#
c997d683 |
|
24-Feb-2024 |
Chengming Zhou <zhouchengming@bytedance.com> |
vfs: remove SLAB_MEM_SPREAD flag usage The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1 (see [1]), so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete explicitly to avoid confusion for users. Here we can just remove all its users, which has no any functional change. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz [1] Link: https://lore.kernel.org/r/20240224135315.830477-1-chengming.zhou@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
7e4a205f |
|
03-Feb-2024 |
Al Viro <viro@zeniv.linux.org.uk> |
Revert "get rid of DCACHE_GENOCIDE" This reverts commit 57851607326a2beef21e67f83f4f53a90df8445a. Unfortunately, while we only call that thing once, the callback *can* be called more than once for the same dentry - all it takes is rename_lock being touched while we are in d_walk(). For now let's revert it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1b6ae9f6 |
|
06-Nov-2023 |
Vegard Nossum <vegard.nossum@oracle.com> |
dcache: remove unnecessary NULL check in dget_dlock() dget_dlock() requires dentry->d_lock to be held when called, yet contains a NULL check for dentry. An audit of all calls to dget_dlock() shows that it is never called with a NULL pointer (as spin_lock()/spin_unlock() would crash in these cases): $ git grep -W '\<dget_dlock\>' arch/powerpc/platforms/cell/spufs/inode.c- spin_lock(&dentry->d_lock); arch/powerpc/platforms/cell/spufs/inode.c- if (simple_positive(dentry)) { arch/powerpc/platforms/cell/spufs/inode.c: dget_dlock(dentry); fs/autofs/expire.c- spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED); fs/autofs/expire.c- if (simple_positive(child)) { fs/autofs/expire.c: dget_dlock(child); fs/autofs/root.c: dget_dlock(active); fs/autofs/root.c- spin_unlock(&active->d_lock); fs/autofs/root.c: dget_dlock(expiring); fs/autofs/root.c- spin_unlock(&expiring->d_lock); fs/ceph/dir.c- if (!spin_trylock(&dentry->d_lock)) fs/ceph/dir.c- continue; [...] fs/ceph/dir.c: dget_dlock(dentry); fs/ceph/mds_client.c- spin_lock(&alias->d_lock); [...] fs/ceph/mds_client.c: dn = dget_dlock(alias); fs/configfs/inode.c- spin_lock(&dentry->d_lock); fs/configfs/inode.c- if (simple_positive(dentry)) { fs/configfs/inode.c: dget_dlock(dentry); fs/libfs.c: found = dget_dlock(d); fs/libfs.c- spin_unlock(&d->d_lock); fs/libfs.c: found = dget_dlock(child); fs/libfs.c- spin_unlock(&child->d_lock); fs/libfs.c: child = dget_dlock(d); fs/libfs.c- spin_unlock(&d->d_lock); fs/ocfs2/dcache.c: dget_dlock(dentry); fs/ocfs2/dcache.c- spin_unlock(&dentry->d_lock); include/linux/dcache.h:static inline struct dentry *dget_dlock(struct dentry *dentry) After taking out the NULL check, dget_dlock() becomes almost identical to __dget_dlock(); the only difference is that dget_dlock() returns the dentry that was passed in. These are static inline helpers, so we can rely on the compiler to discard unused return values. We can therefore also remove __dget_dlock() and replace calls to it by dget_dlock(). Also fix up and improve the kerneldoc comments while we're at it. Al Viro pointed out that we can also clean up some of the callers to make use of the returned value and provided a bit more info for the kerneldoc. While preparing v2 I also noticed that the tabs used in the kerneldoc comments were causing the kerneldoc to get parsed incorrectly so I also fixed this up (including for d_unhashed, which is otherwise unrelated). Testing: x86 defconfig build + boot; make htmldocs for the kerneldoc warning. objdump shows there are code generation changes. Link: https://lore.kernel.org/all/20231022164520.915013-1-vegard.nossum@oracle.com/ Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: Nick Piggin <npiggin@kernel.dk> Cc: Waiman Long <Waiman.Long@hp.com> Cc: linux-doc@vger.kernel.org Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1b327b5a |
|
10-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
kill DCACHE_MAY_FREE With the new ordering in __dentry_kill() it has become redundant - it's set if and only if both DCACHE_DENTRY_KILLED and DCACHE_SHRINK_LIST are set. We set it in __dentry_kill(), after having set DCACHE_DENTRY_KILLED with the only condition being that DCACHE_SHRINK_LIST is there; all of that is done without dropping ->d_lock and the only place that checks that flag (shrink_dentry_list()) does so under ->d_lock, after having found the victim on its shrink list. Since DCACHE_SHRINK_LIST is set only when placing dentry into shrink list and removed only by shrink_dentry_list() itself, a check for DCACHE_DENTRY_KILLED in there would be equivalent to check for DCACHE_MAY_FREE. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ef69f050 |
|
23-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
__d_unalias() doesn't use inode argument ... and hasn't since 2015. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f9f677c5 |
|
11-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant We only search in the damn thing under hlist_bl_lock(); RCU variant of insertion was, IIRC, pretty much cargo-culted - mea culpa... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
57851607 |
|
12-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of DCACHE_GENOCIDE ... now that we never call d_genocide() other than from kill_litter_super() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9024b4c9 |
|
11-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
d_alloc_pseudo(): move setting ->d_op there from the (sole) caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f2824db1 |
|
18-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller now that the only user of d_instantiate_anon() is gone... [braino fix folded - kudos to Dan Carpenter] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6367b491 |
|
23-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
retain_dentry(): introduce a trimmed-down lockless variant fast_dput() contains a small piece of code, preceded by scary comments about 5 times longer than it. What is actually done there is a trimmed-down subset of retain_dentry() - in some situations we can tell that retain_dentry() would have returned true without ever needing ->d_lock and that's what that code checks. If these checks come true fast_dput() can declare that we are done, without bothering with ->d_lock; otherwise it has to take the lock and do full variant of retain_dentry() checks. Trimmed-down variant of the checks is hard to follow and it's asking for trouble - if we ever decide to change the rules in retain_dentry(), we'll have to remember to update that code. It turns out that an equivalent variant of these checks more obviously parallel to retain_dentry() is not just possible, but easy to unify with retain_dentry() itself, passing it a new boolean argument ('locked') to distinguish between the full semantics and trimmed down one. Note that in lockless case true is returned only when locked variant would have returned true without ever needing the lock; false means "punt to the locking path and recheck there". Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1c18edd1 |
|
07-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
__dentry_kill(): new locking scheme Currently we enter __dentry_kill() with parent (along with the victim dentry and victim's inode) held locked. Then we mark dentry refcount as dead call ->d_prune() remove dentry from hash remove it from the parent's list of children unlock the parent, don't need it from that point on detach dentry from inode, unlock dentry and drop the inode (via ->d_iput()) call ->d_release() regain the lock on dentry check if it's on a shrink list (in which case freeing its empty husk has to be left to shrink_dentry_list()) or not (in which case we can free it ourselves). In the former case, mark it as an empty husk, so that shrink_dentry_list() would know it can free the sucker. drop the lock on dentry ... and usually the caller proceeds to drop a reference on the parent, possibly retaking the lock on it. That is painful for a bunch of reasons, starting with the need to take locks out of order, but not limited to that - the parent of positive dentry can change if we drop its ->d_lock, so getting these locks has to be done with care. Moreover, as soon as dentry is out of the parent's list of children, shrink_dcache_for_umount() won't see it anymore, making it appear as if the parent is inexplicably busy. We do work around that by having shrink_dentry_list() decrement the parent's refcount first and put it on shrink list to be evicted once we are done with __dentry_kill() of child, but that may in some cases lead to ->d_iput() on child called after the parent got killed. That doesn't happen in cases where in-tree ->d_iput() instances might want to look at the parent, but that's brittle as hell. Solution: do removal from the parent's list of children in the very end of __dentry_kill(). As the result, the callers do not need to lock the parent and by the time we really need the parent locked, dentry is negative and is guaranteed not to be moved around. It does mean that ->d_prune() will be called with parent not locked. It also means that we might see dentries in process of being torn down while going through the parent's list of children; those dentries will be unhashed, negative and with refcount marked dead. In practice, that's enough for in-tree code that looks through the list of children to do the right thing as-is. Out-of-tree code might need to be adjusted. Calling conventions: __dentry_kill(dentry) is called with dentry->d_lock held, along with ->i_lock of its inode (if any). It either returns the parent (locked, with refcount decremented to 0) or NULL (if there'd been no parent or if refcount decrement for parent hadn't reached 0). lock_for_kill() is adjusted for new requirements - it doesn't touch the parent's ->d_lock at all. Callers adjusted. Note that for dput() we don't need to bother with fast_dput() for the parent - we just need to check retain_dentry() for it, since its ->d_lock is still held since the moment when __dentry_kill() had taken it to remove the victim from the list of children. The kludge with early decrement of parent's refcount in shrink_dentry_list() is no longer needed - shrink_dcache_for_umount() sees the half-killed dentries in the list of children for as long as they are pinning the parent. They are easily recognized and accounted for by select_collect(), so we know we are not done yet. As the result, we always have the expected ordering for ->d_iput()/->d_release() vs. __dentry_kill() of the parent, no exceptions. Moreover, the current rules for shrink lists (one must make sure that shrink_dcache_for_umount() won't happen while any dentries from the superblock in question are on any shrink lists) are gone - shrink_dcache_for_umount() will do the right thing in all cases, taking such dentries out. Their empty husks (memory occupied by struct dentry itself + its external name, if any) will remain on the shrink lists, but they are no obstacles to filesystem shutdown. And such husks will get freed as soon as shrink_dentry_list() of the list they are on gets to them. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b4cc0734 |
|
03-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
d_prune_aliases(): use a shrink list Instead of dropping aliases one by one, restarting, etc., just collect them into a shrink list and kill them off in one pass. We don't really need the restarts - one alias can't pin another (directory has only one alias, and couldn't be its own ancestor anyway), so collecting everything that is not busy and taking it out would take care of everything evictable that had been there as we entered the function. And new aliases added while we'd been dropping old ones could just as easily have appeared right as we return to caller... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f5c8a8a4 |
|
30-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
switch select_collect{,2}() to use of to_shrink_list() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c2e5e29f |
|
29-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
to_shrink_list(): call only if refcount is 0 The only thing it does if refcount is not zero is d_lru_del(); no point, IMO, seeing that plain dput() does nothing of that sort... Note that 2 of 3 current callers are guaranteed that refcount is 0. Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5e7a5c8d |
|
30-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
fold dentry_kill() into dput() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
339e9e13 |
|
30-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
don't try to cut corners in shrink_lock_dentry() That is to say, do *not* treat the ->d_inode or ->d_parent changes as "it's hard, return false; somebody must have grabbed it, so even if has zero refcount, we don't need to bother killing it - final dput() from whoever grabbed it would've done everything". First of all, that is not guaranteed. It might have been dropped by shrink_kill() handling of victim's parent, which would've found it already on a shrink list (ours) and decided that they don't need to put it on their shrink list. What's more, dentry_kill() is doing pretty much the same thing, cutting its own set of corners (it assumes that dentry can't go from positive to negative, so its inode can change but only once and only in one direction). Doing that right allows to get rid of that not-quite-duplication and removes the only reason for re-incrementing refcount before the call of dentry_kill(). Replacement is called lock_for_kill(); called under rcu_read_lock and with ->d_lock held. If it returns false, dentry has non-zero refcount and the same locks are held. If it returns true, dentry has zero refcount and all locks required by __dentry_kill() are taken. Part of __lock_parent() had been lifted into lock_parent() to allow its reuse. Now it's called with rcu_read_lock already held and dentry already unlocked. Note that this is not the final change - locking requirements for __dentry_kill() are going to change later in the series and the set of locks taken by lock_for_kill() will be adjusted. Both lock_parent() and __lock_parent() will be gone once that happens. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f05441c7 |
|
30-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
fold the call of retain_dentry() into fast_dput() Calls of retain_dentry() happen immediately after getting false from fast_dput() and getting true from retain_dentry() is treated the same way as non-zero refcount would be treated by fast_dput() - unlock dentry and bugger off. Doing that in fast_dput() itself is simpler. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2f42f1eb |
|
29-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
Call retain_dentry() with refcount 0 Instead of bumping it from 0 to 1, calling retain_dentry(), then decrementing it back to 0 (with ->d_lock held all the way through), just leave refcount at 0 through all of that. It will have a visible effect for ->d_delete() - now it can be called with refcount 0 instead of 1 and it can no longer play silly buggers with dropping/regaining ->d_lock. Not that any in-tree instances tried to (it's pretty hard to get right). Any out-of-tree ones will have to adjust (assuming they need any changes). Note that we do not need to extend rcu-critical area here - we have verified that refcount is non-negative after having grabbed ->d_lock, so nobody will be able to free dentry until they get into __dentry_kill(), which won't happen until they manage to grab ->d_lock. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b06c684d |
|
30-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
dentry_kill(): don't bother with retain_dentry() on slow path We have already checked it and dentry used to look not worthy of keeping. The only hard obstacle to evicting dentry is non-zero refcount; everything else is advisory - e.g. memory pressure could evict any dentry found with refcount zero. On the slow path in dentry_kill() we had dropped and regained ->d_lock; we must recheck the refcount, but everything else is not worth bothering with. Note that filesystem can not count upon ->d_delete() being called for dentry - not even once. Again, memory pressure (as well as d_prune_aliases(), or attempted rmdir() of ancestor, or...) will not call ->d_delete() at all. So from the correctness point of view we are fine doing the check only once. And it makes things simpler down the road. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ee0c8250 |
|
29-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
__dentry_kill(): get consistent rules for victim's refcount Currently we call it with refcount equal to 1 when called from dentry_kill(); all other callers have it equal to 0. Make it always be called with zero refcount; on this step we just decrement it before the calls in dentry_kill(). That is safe, since all places that care about the value of refcount either do that under ->d_lock or hold a reference to dentry in question. Either is sufficient to prevent observing a dentry immediately prior to __dentry_kill() getting called from dentry_kill(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e9d130d0 |
|
29-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
make retain_dentry() neutral with respect to refcounting retain_dentry() used to decrement refcount if and only if it returned true. Lift those decrements into the callers. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6511f6be |
|
29-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
__dput_to_list(): do decrement of refcount in the callers ... and rename it to to_shrink_list(), seeing that it no longer does dropping any references Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
15f23734 |
|
29-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
fast_dput(): new rules for refcount By now there is only one place in entire fast_dput() where we return false; that happens after refcount had been decremented and found (under ->d_lock) to be zero. In that case, just prior to returning false to caller, fast_dput() forcibly changes the refcount from 0 to 1. Lift that resetting refcount to 1 into the callers; later in the series it will be massaged out of existence. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
504e08ce |
|
31-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
fast_dput(): handle underflows gracefully If refcount is less than 1, we should just warn, unlock dentry and return true, so that the caller doesn't try to do anything else. Taking care of that leaves the rest of "lockref_put_return() has failed" case equivalent to "decrement refcount and rejoin the normal slow path after the point where we grab ->d_lock". NOTE: lockref_put_return() is strictly a fastpath thing - unlike the rest of lockref primitives, it does not contain a fallback. Caller (and it looks like fast_dput() is the only legitimate one in the entire kernel) has to do that itself. Reasons for lockref_put_return() failures: * ->d_lock held by somebody * refcount <= 0 * ... or an architecture not supporting lockref use of cmpxchg - sparc, anything non-SMP, config with spinlock debugging... We could add a fallback, but it would be a clumsy API - we'd have to distinguish between: (1) refcount > 1 - decremented, lock not held on return (2) refcount < 1 - left alone, probably no sense to hold the lock (3) refcount is 1, no cmphxcg - decremented, lock held on return (4) refcount is 1, cmphxcg supported - decremented, lock *NOT* held on return. We want to return with no lock held in case (4); that's the whole point of that thing. We very much do not want to have the fallback in case (3) return without a lock, since the caller might have to retake it in that case. So it wouldn't be more convenient than doing the fallback in the caller and it would be very easy to screw up, especially since the test coverage would suck - no way to test (3) and (4) on the same kernel build. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
15220fbf |
|
29-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
fast_dput(): having ->d_delete() is not reason to delay refcount decrement ->d_delete() is a way for filesystem to tell that dentry is not worth keeping cached. It is not guaranteed to be called every time a dentry has refcount drop down to zero; it is not guaranteed to be called before dentry gets evicted. In other words, it is not suitable for any kind of keeping track of dentry state. None of the in-tree filesystems attempt to use it that way, fortunately. So the contortions done by fast_dput() (as well as dentry_kill()) are not warranted. fast_dput() certainly should treat having ->d_delete() instance as "can't assume we'll be keeping it", but that's not different from the way we treat e.g. DCACHE_DONTCACHE (which is rather similar to making ->d_delete() returns true when called). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
cd9f84f3 |
|
30-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
shrink_dentry_list(): no need to check that dentry refcount is marked dead ... we won't see DCACHE_MAY_FREE on anything that is *not* dead and checking d_flags is just as cheap as checking refcount. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3fcf5356 |
|
07-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
centralize killing dentry from shrink list new helper unifying identical bits of shrink_dentry_list() and shring_dcache_for_umount() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
da549bdd |
|
07-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
dentry: switch the lists of children to hlist Saves a pointer per struct dentry and actually makes the things less clumsy. Cleaned the d_walk() and dcache_readdir() a bit by use of hlist_for_... iterators. A couple of new helpers - d_first_child() and d_next_sibling(), to make the expressions less awful. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8219cb58 |
|
10-Nov-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
kill d_{is,set}_fallthru() Introduced in 2015 and never had any in-tree users... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6d73c9ce |
|
29-Oct-2023 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of __dget() fold into the sole remaining caller Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9d5b9475 |
|
20-Nov-2023 |
Joel Granados <j.granados@samsung.com> |
fs: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel elements ctl_table struct. Special attention was placed in making sure that an empty directory for fs/verity was created when CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the register sysctl call that expects a size. Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
#
0a97c01c |
|
30-Nov-2023 |
Nhat Pham <nphamcs@gmail.com> |
list_lru: allow explicit memcg and NUMA node selection Patch series "workload-specific and memory pressure-driven zswap writeback", v8. There are currently several issues with zswap writeback: 1. There is only a single global LRU for zswap, making it impossible to perform worload-specific shrinking - an memcg under memory pressure cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u But this solution leaves a lot to be desired, as we still do not have an avenue for an memcg to free up its own memory locked up in the zswap pool. 2. We only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch series solves these issues by separating the global zswap LRU into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The new shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. This patch (of 6): The interface of list_lru is based on the assumption that the list node and the data it represents belong to the same allocated on the correct node/memcg. While this assumption is valid for existing slab objects LRU such as dentries and inodes, it is undocumented, and rather inflexible for certain potential list_lru users (such as the upcoming zswap shrinker and the THP shrinker). It has caused us a lot of issues during our development. This patch changes list_lru interface so that the caller must explicitly specify numa node and memcg when adding and removing objects. The old list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and list_lru_del_obj(), respectively. It also extends the list_lru API with a new function, list_lru_putback, which undoes a previous list_lru_isolate call. Unlike list_lru_add, it does not increment the LRU node count (as list_lru_isolate does not decrement the node count). list_lru_putback also allows for explicit memcg and NUMA node selection. Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
68279f9c |
|
11-Oct-2023 |
Alexey Dobriyan <adobriyan@gmail.com> |
treewide: mark stuff as __ro_after_init __read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1. Also, mark some stuff as "const" and "__init" while I'm at it. [akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
771eb4fe |
|
09-Jul-2018 |
Kent Overstreet <kent.overstreet@gmail.com> |
fs: factor out d_mark_tmpfile() New helper for bcachefs - bcachefs doesn't want the inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on its own atomically with other btree updates Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org>
|
#
8c8e7dba |
|
17-Aug-2023 |
Anh Tuan Phan <tuananhlfc@gmail.com> |
fs/dcache: Replace printk and WARN_ON by WARN Use WARN instead of printk + WARN_ON as reported from coccinelle: ./fs/dcache.c:1667:1-7: SUGGESTION: printk + WARN_ON can be just WARN Signed-off-by: Anh Tuan Phan <tuananhlfc@gmail.com> Message-Id: <20230817163142.117706-1-tuananhlfc@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
4352b8cd |
|
08-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
fs: unexport d_genocide d_genocide is only used by built-in code. Signed-off-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230808161704.1099680-1-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
863f144f |
|
23-Sep-2022 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: open inside ->tmpfile() This is in preparation for adding tmpfile support to fuse, which requires that the tmpfile creation and opening are done as a single operation. Replace the 'struct dentry *' argument of i_op->tmpfile with 'struct file *'. Call finish_open_simple() as the last thing in ->tmpfile() instances (may be omitted in the error case). Change d_tmpfile() argument to 'struct file *' as well to make callers more readable. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
93f6d4e1 |
|
25-Aug-2022 |
Thomas Gleixner <tglx@linutronix.de> |
dentry: Use preempt_[dis|en]able_nested() Replace the open coded CONFIG_PREEMPT_RT conditional preempt_disable/enable() with the new helper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220825164131.402717-3-bigeasy@linutronix.de
|
#
ae2a8236 |
|
11-Aug-2022 |
Linus Torvalds <torvalds@linux-foundation.org> |
dcache: move the DCACHE_OP_COMPARE case out of the __d_lookup_rcu loop __d_lookup_rcu() is one of the hottest functions in the kernel on certain loads, and it is complicated by filesystems that might want to have their own name compare function. We can improve code generation by moving the test of DCACHE_OP_COMPARE outside the loop, which makes the loop itself much simpler, at the cost of some code duplication. But both cases end up being simpler, and the "native" direct case-sensitive compare particularly so. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4f48d5da |
|
15-May-2022 |
Xiubo Li <xiubli@redhat.com> |
fs/dcache: export d_same_name() helper Compare dentry name with case-exact name, return true if names are same, or false. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
50417d22 |
|
27-Jul-2022 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
fs/dcache: Move wakeup out of i_seq_dir write held region. __d_add() and __d_move() wake up waiters on dentry::d_wait from within the i_seq_dir write held region. This violates the PREEMPT_RT constraints as the wake up acquires wait_queue_head::lock which is a "sleeping" spinlock on RT. There is no requirement to do so. __d_lookup_unhash() has cleared DCACHE_PAR_LOOKUP and dentry::d_wait and returned the now unreachable wait queue head pointer to the caller, so the actual wake up can be postponed until the i_dir_seq write side critical section is left. The only requirement is that dentry::lock is held across the whole sequence including the wake up. The previous commit includes an analysis why this is considered safe. Move the wake up past end_dir_add() which leaves the i_dir_seq write side critical section and enables preemption. For non RT kernels there is no difference because preemption is still disabled due to dentry::lock being held, but it shortens the time between wake up and unlocking dentry::lock, which reduces the contention for the woken up waiter. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
45f78b0a |
|
27-Jul-2022 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
fs/dcache: Move the wakeup from __d_lookup_done() to the caller. __d_lookup_done() wakes waiters on dentry->d_wait. On PREEMPT_RT we are not allowed to do that with preemption disabled, since the wakeup acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT. Calling it under dentry->d_lock is not a problem, since that is also a "sleeping" spinlock on the same configs. Unfortunately, two of its callers (__d_add() and __d_move()) are holding more than just ->d_lock and that needs to be dealt with. The key observation is that wakeup can be moved to any point before dropping ->d_lock. As a first step to solve this, move the wake up outside of the hlist_bl_lock() held section. This is safe because: Waiters get inserted into ->d_wait only after they'd taken ->d_lock and observed DCACHE_PAR_LOOKUP in flags. As long as they are woken up (and evicted from the queue) between the moment __d_lookup_done() has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe, since the waitqueue ->d_wait points to won't get destroyed without having __d_lookup_done(dentry) called (under ->d_lock). ->d_wait is set only by d_alloc_parallel() and only in case when it returns a freshly allocated in-lookup dentry. Whenever that happens, we are guaranteed that __d_lookup_done() will be called for resulting dentry (under ->d_lock) before the wq in question gets destroyed. With two exceptions wq lives in call frame of the caller of d_alloc_parallel() and we have an explicit d_lookup_done() on the resulting in-lookup dentry before we leave that frame. One of those exceptions is nfs_call_unlink(), where wq is embedded into (dynamically allocated) struct nfs_unlinkdata. It is destroyed in nfs_async_unlink_release() after an explicit d_lookup_done() on the dentry wq went into. Remaining exception is d_add_ci(). There wq is what we'd found in ->d_wait of d_add_ci() argument. Callers of d_add_ci() are two instances of ->d_lookup() and they must have been given an in-lookup dentry. Which means that they'd been called by __lookup_slow() or lookup_open(), with wq in the call frame of one of those. Result of d_alloc_parallel() in d_add_ci() is fed to d_splice_alias(), which either returns non-NULL (and d_add_ci() does d_lookup_done()) or feeds dentry to __d_add() that will do __d_lookup_done() under ->d_lock. That concludes the analysis. Let __d_lookup_unhash(): 1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP 2) Unhash the dentry 3) Retrieve and clear dentry::d_wait 4) Unlock the hash and return the retrieved waitqueue head pointer 5) Let the caller handle the wake up. 6) Rename __d_lookup_done() to __d_lookup_unhash_wake() to enforce build failures for OOT code that used __d_lookup_done() and is not aware of the new return value. This does not yet solve the PREEMPT_RT problem completely because preemption is still disabled due to i_dir_seq being held for write. This will be addressed in subsequent steps. An alternative solution would be to switch the waitqueue to a simple waitqueue, but aside of Linus not being a fan of them, moving the wake up closer to the place where dentry::lock is unlocked reduces lock contention time for the woken up waiter. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20220613140712.77932-3-bigeasy@linutronix.de Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
cf634d54 |
|
27-Jul-2022 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT i_dir_seq is a sequence counter with a lock which is represented by the lowest bit. The writer atomically updates the counter which ensures that it can be modified by only one writer at a time. This requires preemption to be disabled across the write side critical section. On !PREEMPT_RT kernels this is implicit by the caller acquiring dentry::lock. On PREEMPT_RT kernels spin_lock() does not disable preemption which means that a preempting writer or reader would live lock. It's therefore required to disable preemption explicitly. An alternative solution would be to replace i_dir_seq with a seqlock_t for PREEMPT_RT, but that comes with its own set of problems due to arbitrary lock nesting. A pure sequence count with an associated spinlock is not possible because the locks held by the caller are not necessarily related. As the critical section is small, disabling preemption is a sensible solution. Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20220613140712.77932-2-bigeasy@linutronix.de Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
40a3cb0d |
|
29-Jul-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
d_add_ci(): make sure we don't miss d_lookup_done() All callers of d_alloc_parallel() must make sure that resulting in-lookup dentry (if any) will encounter __d_lookup_done() before the final dput(). d_add_ci() might end up creating in-lookup dentries; they are fed to d_splice_alias(), which will normally make sure they meet __d_lookup_done(). However, it is possible to end up with d_splice_alias() failing with ERR_PTR(-ELOOP) without having done so. It takes a corrupted ntfs or case-insensitive xfs image, but neither should end up with memory corruption... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f53bf711 |
|
22-Mar-2022 |
Muchun Song <songmuchun@bytedance.com> |
mm: dcache: use kmem_cache_alloc_lru() to allocate dentry Like inode cache, the dentry will also be added to its memcg list_lru. So replace kmem_cache_alloc() with kmem_cache_alloc_lru() to allocate dentry. Link: https://lkml.kernel.org/r/20220228122126.37293-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c8c0c239 |
|
21-Jan-2022 |
Luis Chamberlain <mcgrof@kernel.org> |
fs: move dcache sysctls to its own file kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain. To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic. So move the dcache sysctl clutter out of kernel/sysctl.c. This is a small one-off entry, perhaps later we can simplify this representation, but for now we use the helpers we have. We won't know how we can simplify this further untl we're fully done with the cleanup. [arnd@arndb.de: avoid unused-function warning] Link: https://lkml.kernel.org/r/20211203190123.874239-2-arnd@kernel.org Link: https://lkml.kernel.org/r/20211129205548.605569-4-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Stephen Kitt <steve@sk2.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
80e5d1ff5 |
|
15-Apr-2021 |
Al Viro <viro@zeniv.linux.org.uk> |
useful constants: struct qstr for ".." Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3d742d4b |
|
24-Feb-2021 |
Randy Dunlap <rdunlap@infradead.org> |
fs: delete repeated words in comments Delete duplicate words in fs/*.c. The doubled words that are being dropped are: that, be, the, in, and, for Link: https://lkml.kernel.org/r/20201224052810.25315-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
961f3c89 |
|
14-Jan-2021 |
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> |
fs: fix kernel-doc markups Two markups are at the wrong place. Kernel-doc only support having the comment just before the identifier. Also, some identifiers have different names between their prototypes and the kernel-doc markup. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/96b1e1b388600ab092331f6c4e88ff8e8779ce6c.1610610937.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
#
bca585d2 |
|
05-Jan-2021 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: d_find_alias_rcu() similar to d_find_alias(inode), except that * the caller must be holding rcu_read_lock() * inode must not be freed until matching rcu_read_unlock() * result is *NOT* pinned and can only be dereferenced until the matching rcu_read_unlock(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
77573fa3 |
|
07-Dec-2020 |
Hao Li <lihao2018.fnst@cn.fujitsu.com> |
fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set If DCACHE_REFERENCED is set, fast_dput() will return true, and then retain_dentry() have no chance to check DCACHE_DONTCACHE. As a result, the dentry won't be killed and the corresponding inode can't be evicted. In the following example, the DAX policy can't take effects unless we do a drop_caches manually. # DCACHE_LRU_LIST will be set echo abcdefg > test.txt # DCACHE_REFERENCED will be set and DCACHE_DONTCACHE can't do anything xfs_io -c 'chattr +x' test.txt # Drop caches to make DAX changing take effects echo 2 > /proc/sys/vm/drop_caches What this patch does is preventing fast_dput() from returning true if DCACHE_DONTCACHE is set. Then retain_dentry() will detect the DCACHE_DONTCACHE and will return false. As a result, the dentry will be killed and the inode will be evicted. In this way, if we change per-file DAX policy, it will take effects automatically after this file is closed by all processes. I also add some comments to make the code more clear. Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
26475371 |
|
20-Jul-2020 |
Ahmed S. Darwish <a.darwish@linutronix.de> |
vfs: Use sequence counter with associated spinlock A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-19-a.darwish@linutronix.de
|
#
2c567af4 |
|
30-Apr-2020 |
Ira Weiny <ira.weiny@intel.com> |
fs: Introduce DCACHE_DONTCACHE DCACHE_DONTCACHE indicates a dentry should not be cached on final dput(). Also add a helper function to mark DCACHE_DONTCACHE on all dentries pointing to a specific inode when that inode is being set I_DONTCACHE. This facilitates dropping dentry references to inodes sooner which require eviction to swap S_DAX mode. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
32927393 |
|
24-Apr-2020 |
Christoph Hellwig <hch@lst.de> |
sysctl: pass kernel pointers to ->proc_handler Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2fa6b1e0 |
|
12-Nov-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
fs/namei.c: fix missing barriers when checking positivity Pinned negative dentries can, generally, be made positive by another thread. Conditions that prevent that are * ->d_lock on dentry in question * parent directory held at least shared * nobody else could have observed the address of dentry Most of the places working with those fall into one of those categories; however, d_lookup() and friends need to be used with some care. Fortunately, there's not a lot of call sites, and with few exceptions all of those fall under one of the cases above. Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache() and mountpoint_last(). Another one is lookup_slow() - there dcache lookup is done with parent held shared, but the result is used after we'd drop the lock. The same happens in do_last() - the lookup (in lookup_one()) is done with parent locked, but result is used after unlocking. lookup_fast(), do_last() and mountpoint_last() flat-out reject negatives. Most of lookup_dcache() calls are made with parent locked at least shared; the only exception is lookup_one_len_unlocked(). It might return pinned negative, needs serious care from callers. Fortunately, almost nobody calls it directly anymore; all but two callers have converted to lookup_positive_unlocked(), which rejects negatives. lookup_slow() is called by the same lookup_one_len_unlocked() (see above), mountpoint_last() and walk_component(). In those two negatives are rejected. In other words, there is a small set of places where we need to check carefully if a pinned potentially negative dentry is, in fact, positive. After that check we want to be sure that both ->d_inode and type bits in ->d_flags are stable and observed. The set consists of follow_managed() (where the rejection happens for lookup_fast(), walk_component() and do_last()), last_mountpoint() and lookup_positive_unlocked(). Solution: 1) transition from negative to positive (in __d_set_inode_and_type()) stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits. 2) aforementioned 3 places in fs/namei.c fetch ->d_flags with smp_load_acquire() and bugger off if it type bits say "negative". That way anyone downstream of those checks has dentry know positive pinned, with ->d_inode and type bits of ->d_flags stable and observed. I considered splitting off d_lookup_positive(), so that the checks could be done right there, under ->d_lock. However, that leads to massive duplication of rather subtle code in fs/namei.c and fs/dcache.c. It's worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/ No matter what, autofs_d_manage()/autofs_d_automount() must live with the possibility of pinned negative dentry passed their way, becoming positive under them - that's the intended behaviour when lookup comes in the middle of automount in progress, so we can't keep them out of the area that has to deal with those, more's the pity... Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e8400933 |
|
30-Oct-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
fix dget_parent() fastpath race We are overoptimistic about taking the fast path there; seeing the same value in ->d_parent after having grabbed a reference to that parent does *not* mean that it has remained our parent all along. That wouldn't be a big deal (in the end it is our parent and we have grabbed the reference we are about to return), but... the situation with barriers is messed up. We might have hit the following sequence: d is a dentry of /tmp/a/b CPU1: CPU2: parent = d->d_parent (i.e. dentry of /tmp/a) rename /tmp/a/b to /tmp/b rmdir /tmp/a, making its dentry negative grab reference to parent, end up with cached parent->d_inode (NULL) mkdir /tmp/a, rename /tmp/b to /tmp/a/b recheck d->d_parent, which is back to original decide that everything's fine and return the reference we'd got. The trouble is, caller (on CPU1) will observe dget_parent() returning an apparently negative dentry. It actually is positive, but CPU1 has stale ->d_inode cached. Use d->d_seq to see if it has been moved instead of rechecking ->d_parent. NOTE: we are *NOT* going to retry on any kind of ->d_seq mismatch; we just go into the slow path in such case. We don't wait for ->d_seq to become even either - again, if we are racing with renames, we can bloody well go to slow path anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5c8b0dfc |
|
25-Oct-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
make __d_alloc() static no users outside of fs/dcache.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5facae4f |
|
18-Sep-2019 |
Qian Cai <cai@lca.pw> |
locking/lockdep: Remove unused @nested argument from lock_release() Since the following commit: b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release") @nested is no longer used in lock_release(), so remove it from all lock_release() calls and friends. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: airlied@linux.ie Cc: akpm@linux-foundation.org Cc: alexander.levin@microsoft.com Cc: daniel@iogearbox.net Cc: davem@davemloft.net Cc: dri-devel@lists.freedesktop.org Cc: duyuyang@gmail.com Cc: gregkh@linuxfoundation.org Cc: hannes@cmpxchg.org Cc: intel-gfx@lists.freedesktop.org Cc: jack@suse.com Cc: jlbec@evilplan.or Cc: joonas.lahtinen@linux.intel.com Cc: joseph.qi@linux.alibaba.com Cc: jslaby@suse.com Cc: juri.lelli@redhat.com Cc: maarten.lankhorst@linux.intel.com Cc: mark@fasheh.com Cc: mhocko@kernel.org Cc: mripard@kernel.org Cc: ocfs2-devel@oss.oracle.com Cc: rodrigo.vivi@intel.com Cc: sean@poorly.run Cc: st@kernel.org Cc: tj@kernel.org Cc: tytso@mit.edu Cc: vdavydov.dev@gmail.com Cc: vincent.guittot@linaro.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
9bdebc2b |
|
29-Jun-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists Currently, running into a shrink list that contains dentries from different filesystems can cause several unpleasant things for shrink_dcache_parent() and for umount(2). The first problem is that there's a window during shrink_dentry_list() between __dentry_kill() takes a victim out and dropping reference to its parent. During that window the parent looks like a genuine busy dentry. shrink_dcache_parent() (or, worse yet, shrink_dcache_for_umount()) coming at that time will see no eviction candidates and no indication that it needs to wait for some shrink_dentry_list() to proceed further. That applies for any shrink list that might intersect with the subtree we are trying to shrink; the only reason it does not blow on umount(2) in the mainline is that we unregister the memory shrinker before hitting shrink_dcache_for_umount(). Another problem happens if something in a mixed-filesystem shrink list gets be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for the stuck shrinker to get around to our dentries. Solution: 1) have shrink_dentry_list() decrement the parent's refcount and make sure it's on a shrink list (ours unless it already had been on some other) before calling __dentry_kill(). That eliminates the window when shrink_dcache_parent() would've blown past the entire subtree without noticing anything with zero refcount not on shrink lists. 2) when shrink_dcache_parent() has found no eviction candidates, but some dentries are still sitting on shrink lists, rather than repeating the scan in hope that shrinkers have progressed, scan looking for something on shrink lists with zero refcount. If such a thing is found, grab rcu_read_lock() and stop the scan, with caller locking it for eviction, dropping out of RCU and doing __dentry_kill(), with the same treatment for parent as shrink_dentry_list() would do. Note that right now mixed-filesystem shrink lists do not occur, so this is not a mainline bug. Howevere, there's a bunch of uses for such beasts (e.g. the "try and evict everything we can out of given page" patches; there are potential uses in mount-related code, considerably simplifying the life in fs/namespace.c, etc.) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
49246466 |
|
26-May-2019 |
Amir Goldstein <amir73il@gmail.com> |
fsnotify: move fsnotify_nameremove() hook out of d_delete() d_delete() was piggy backed for the fsnotify_nameremove() hook when in fact not all callers of d_delete() care about fsnotify events. For all callers of d_delete() that may be interested in fsnotify events, we made sure to call one of fsnotify_{unlink,rmdir}() hooks before calling d_delete(). Now we can move the fsnotify_nameremove() call from d_delete() to the fsnotify_{unlink,rmdir}() hooks. Two explicit calls to fsnotify_nameremove() from nfs/afs sillyrename are also removed. This will cause a change of behavior - nfs/afs will NOT generate an fsnotify delete event when renaming over a positive dentry. This change is desirable, because it is consistent with the behavior of all other filesystems. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
457c8996 |
|
19-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
230c6402 |
|
26-Apr-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
ovl_lookup_real_one(): don't bother with strlen() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0bf3d5c1 |
|
20-Mar-2019 |
Eric Biggers <ebiggers@google.com> |
fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory Make __d_move() clear DCACHE_ENCRYPTED_NAME on the source dentry. This is needed for when d_splice_alias() moves a directory's encrypted alias to its decrypted alias as a result of the encryption key being added. Otherwise, the decrypted alias will incorrectly be invalidated on the next lookup, causing problems such as unmounting a mount the user just mount()ed there. Note that we don't have to support arbitrary moves of this flag because fscrypt doesn't allow dentries with DCACHE_ENCRYPTED_NAME to be the source or target of a rename(). Fixes: 28b4c263961c ("ext4 crypto: revalidate dentry after adding or removing the key") Reported-by: Sarthak Kukreti <sarthakkukreti@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
ab1152dd |
|
15-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
unexport d_alloc_pseudo() No modular uses since introducion of alloc_file_pseudo(), and the only non-modular user not in alloc_file_pseudo() had actually been wrong - should've been d_alloc_anon(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5467a68c |
|
15-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
dcache: sort the freeing-without-RCU-delay mess for good. For lockless accesses to dentries we don't have pinned we rely (among other things) upon having an RCU delay between dropping the last reference and actually freeing the memory. On the other hand, for things like pipes and sockets we neither do that kind of lockless access, nor want to deal with the overhead of an RCU delay every time a socket gets closed. So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags made sure it would happen. We tried to avoid setting it unless we knew we need it. Unfortunately, that had led to recurring class of bugs, in which we missed the need to set it. We only really need it for dentries that are created by d_alloc_pseudo(), so let's not bother with trying to be smart - just make having an RCU delay the default. The ones that do *not* get it set the replacement flag (DCACHE_NORCU) and we'd better use that sparingly. d_alloc_pseudo() is the only such user right now. FWIW, the race that finally prompted that switch had been between __lock_parent() of immediate subdirectory of what's currently the root of a disconnected tree (e.g. from open-by-handle in progress) racing with d_splice_alias() elsewhere picking another alias for the same inode, either on outright corrupted fs image, or (in case of open-by-handle on NFS) that subdirectory having been just moved on server. It's not easy to hit, so the sky is not falling, but that's not the first race on similar missed cases and the logics for settinf DCACHE_RCUACCESS has gotten ridiculously convoluted. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
af0c9af1 |
|
30-Jan-2019 |
Waiman Long <longman@redhat.com> |
fs/dcache: Track & report number of negative dentries The current dentry number tracking code doesn't distinguish between positive & negative dentries. It just reports the total number of dentries in the LRU lists. As excessive number of negative dentries can have an impact on system performance, it will be wise to track the number of positive and negative dentries separately. This patch adds tracking for the total number of negative dentries in the system LRU lists and reports it in the 5th field in the /proc/sys/fs/dentry-state file. The number, however, does not include negative dentries that are in flight but not in the LRU yet as well as those in the shrinker lists which are on the way out anyway. The number of positive dentries in the LRU lists can be roughly found by subtracting the number of negative dentries from the unused count. Matthew Wilcox had confirmed that since the introduction of the dentry_stat structure in 2.1.60, the dummy array was there, probably for future extension. They were not replacements of pre-existing fields. So no sane applications that read the value of /proc/sys/fs/dentry-state will do dummy thing if the last 2 fields of the sysctl parameter are not zero. IOW, it will be safe to use one of the dummy array entry for negative dentry count. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1dbd449c |
|
30-Jan-2019 |
Waiman Long <longman@redhat.com> |
fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb() The nr_dentry_unused per-cpu counter tracks dentries in both the LRU lists and the shrink lists where the DCACHE_LRU_LIST bit is set. The shrink_dcache_sb() function moves dentries from the LRU list to a shrink list and subtracts the dentry count from nr_dentry_unused. This is incorrect as the nr_dentry_unused count will also be decremented in shrink_dentry_list() via d_shrink_del(). To fix this double decrement, the decrement in the shrink_dcache_sb() function is taken out. Fixes: 4e717f5c1083 ("list_lru: remove special case function list_lru_dispose_all." Cc: stable@kernel.org Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
57c8a661 |
|
30-Oct-2018 |
Mike Rapoport <rppt@linux.vnet.ibm.com> |
mm: remove include/linux/bootmem.h Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2e03b4bc |
|
26-Oct-2018 |
Vlastimil Babka <vbabka@suse.cz> |
dcache: allocate external names from reclaimable kmalloc caches We can use the newly introduced kmalloc-reclaimable-X caches, to allocate external names in dcache, which will take care of the proper accounting automatically, and also improve anti-fragmentation page grouping. This effectively reverts commit f1782c9bc547 ("dcache: account external names as indirectly reclaimable memory") and instead passes __GFP_RECLAIMABLE to kmalloc(). The accounting thus moves from NR_INDIRECTLY_RECLAIMABLE_BYTES to NR_SLAB_RECLAIMABLE, which is also considered in MemAvailable calculation and overcommit decisions. Link: http://lkml.kernel.org/r/20180731090649.16028-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6cd00a01 |
|
17-Aug-2018 |
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> |
fs/dcache.c: fix kmemcheck splat at take_dentry_name_snapshot() Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes are initialized at __d_alloc(), we can't copy the whole size unconditionally. WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50) 636f6e66696766732e746d70000000000010000000000000020000000188ffff i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u ^ RIP: 0010:take_dentry_name_snapshot+0x28/0x50 RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246 RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002 RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60 RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001 R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00 R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0 take_dentry_name_snapshot+0x28/0x50 vfs_rename+0x128/0x870 SyS_rename+0x3b2/0x3d0 entry_SYSCALL_64_fastpath+0x1a/0xa4 0xffffffffffffffff Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4c0d7cd5 |
|
09-Aug-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
make sure that __dentry_kill() always invalidates d_seq, unhashed or not RCU pathwalk relies upon the assumption that anything that changes ->d_inode of a dentry will invalidate its ->d_seq. That's almost true - the one exception is that the final dput() of already unhashed dentry does *not* touch ->d_seq at all. Unhashing does, though, so for anything we'd found by RCU dcache lookup we are fine. Unfortunately, we can *start* with an unhashed dentry or jump into it. We could try and be careful in the (few) places where that could happen. Or we could just make the final dput() invalidate the damn thing, unhashed or not. The latter is much simpler and easier to backport, so let's do it that way. Reported-by: "Dae R. Jeong" <threeearcat@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
90bad5e0 |
|
06-Aug-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
root dentries need RCU-delayed freeing Since mountpoint crossing can happen without leaving lazy mode, root dentries do need the same protection against having their memory freed without RCU delay as everything else in the tree. It's partially hidden by RCU delay between detaching from the mount tree and dropping the vfsmount reference, but the starting point of pathwalk can be on an already detached mount, in which case umount-caused RCU delay has already passed by the time the lazy pathwalk grabs rcu_read_lock(). If the starting point happens to be at the root of that vfsmount *and* that vfsmount covers the entire filesystem, we get trouble. Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7964410f |
|
01-Aug-2018 |
Gustavo A. R. Silva <gustavo@embeddedor.com> |
fs: dcache: Use true and false for boolean values Return statements in functions returning bool should use true or false instead of an integer value. This issue was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c2b6d621 |
|
28-Jun-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
new primitive: discard_new_inode() We don't want open-by-handle picking half-set-up in-core struct inode from e.g. mkdir() having failed halfway through. In other words, we don't want such inodes returned by iget_locked() on their way to extinction. However, we can't just have them unhashed - otherwise open-by-handle immediately *after* that would've ended up creating a new in-core inode over the on-disk one that is in process of being freed right under us. Solution: new flag (I_CREATING) set by insert_inode_locked() and removed by unlock_new_inode() and a new primitive (discard_new_inode()) to be used by such halfway-through-setup failure exits instead of unlock_new_inode() / iput() combinations. That primitive unlocks new inode, but leaves I_CREATING in place. iget_locked() treats finding an I_CREATING inode as failure (-ESTALE, once we sort out the error propagation). insert_inode_locked() treats the same as instant -EBUSY. ilookup() treats those as icache miss. [Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c971e6a0 |
|
28-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
kill d_instantiate_no_diralias() The only user is fuse_create_new_entry(), and there it's used to mitigate the same mkdir/open-by-handle race as in nfs_mkdir(). The same solution applies - unhash the mkdir argument, then call d_splice_alias() and if that returns a reference to preexisting alias, dput() and report success. ->mkdir() argument left unhashed negative with the preexisting alias moved in the right place is just fine from the ->mkdir() callers point of view. Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
63a67a92 |
|
23-Jun-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
kill dentry_update_name_case() the last user is gone Spotted-by: Richard Weinberger <richard@nod.at> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
61fec493 |
|
25-Apr-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of dead code in d_find_alias() All "try disconnected alias if nothing else fits" logics in d_find_alias() got accidentally disabled by Neil a while ago; for most of the callers it was the right thing to do, so fixes belong in few callers that *do* want disconnected aliases. This just takes the now-dead code in d_find_alias() out. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1e2e547a |
|
04-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
do d_instantiate/unlock_new_inode combinations safely For anything NFS-exported we do _not_ want to unlock new inode before it has grown an alias; original set of fixes got the ordering right, but missed the nasty complication in case of lockdep being enabled - unlock_new_inode() does lockdep_annotate_inode_mutex_key(inode) which can only be done before anyone gets a chance to touch ->i_mutex. Unfortunately, flipping the order and doing unlock_new_inode() before d_instantiate() opens a window when mkdir can race with open-by-fhandle on a guessed fhandle, leading to multiple aliases for a directory inode and all the breakage that follows from that. Correct solution: a new primitive (d_instantiate_new()) combining these two in the right order - lockdep annotate, then d_instantiate(), then the rest of unlock_new_inode(). All combinations of d_instantiate() with unlock_new_inode() should be converted to that. Cc: stable@kernel.org # 2.6.29 and later Tested-by: Mike Marshall <hubcap@omnibond.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4fb48871 |
|
19-Apr-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
restore cond_resched() in shrink_dcache_parent() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1088a640 |
|
15-Apr-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
dput(): turn into explicit while() loop No need to mess with gotos when the code yielded by straight while() isn't any worse... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9c5f1d30 |
|
15-Apr-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
dcache: move cond_resched() into the end of __dentry_kill() cond_resched() in shrink_dentry_list() is too early Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3a8e3611 |
|
15-Apr-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
d_walk(): kill 'finish' callback no users left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ff17fa56 |
|
15-Apr-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
d_invalidate(): unhash immediately Once that is done, we can just hunt mountpoints down one by one; no new mountpoints can be added from now on, so we don't need anything tricky in finish() callback, etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
32785c05 |
|
10-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
fs/dcache.c: add cond_resched() in shrink_dentry_list() As previously reported (https://patchwork.kernel.org/patch/8642031/) it's possible to call shrink_dentry_list with a large number of dentries (> 10000). This, in turn, could trigger the softlockup detector and possibly trigger a panic. In addition to the unmount path being vulnerable to this scenario, at SuSE we've observed similar situation happening during process exit on processes that touch a lot of dentries. Here is an excerpt from a crash dump. The number after the colon are the number of dentries on the list passed to shrink_dentry_list: PID 99760: 10722 PID 107530: 215 PID 108809: 24134 PID 108877: 21331 PID 141708: 16487 So we want to kill between 15k-25k dentries without yielding. And one possible call stack looks like: 4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8 5 [ffff8839ece41db0] evict at ffffffff811c3026 6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258 7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593 8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830 9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61 10 [ffff8839ece41ec0] release_task at ffffffff81059ebd 11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce 12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53 13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909 While some of the callers of shrink_dentry_list do use cond_resched, this is not sufficient to prevent softlockups. So just move cond_resched into shrink_dentry_list from its callers. David said: I've found hundreds of occurrences of warnings that we emit when need_resched stays set for a prolonged period of time with the stack trace that is included in the change log. Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f1782c9b |
|
10-Apr-2018 |
Roman Gushchin <guro@fb.com> |
dcache: account external names as indirectly reclaimable memory I received a report about suspicious growth of unreclaimable slabs on some machines. I've found that it happens on machines with low memory pressure, and these unreclaimable slabs are external names attached to dentries. External names are allocated using generic kmalloc() function, so they are accounted as unreclaimable. But they are held by dentries, which are reclaimable, and they will be reclaimed under the memory pressure. In particular, this breaks MemAvailable calculation, as it doesn't take unreclaimable slabs into account. This leads to a silly situation, when a machine is almost idle, has no memory pressure and therefore has a big dentry cache. And the resulting MemAvailable is too low to start a new workload. To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is used to track the amount of memory, consumed by external names. The counter is increased in the dentry allocation path, if an external name structure is allocated; and it's decreased in the dentry freeing path. To reproduce the problem I've used the following Python script: import os for iter in range (0, 10000000): try: name = ("/some_long_name_%d" % iter) + "_" * 220 os.stat(name) except Exception: pass Without this patch: $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7811688 kB $ python indirect.py $ cat /proc/meminfo | grep MemAvailable MemAvailable: 2753052 kB With the patch: $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7809516 kB $ python indirect.py $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7749144 kB [guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB] Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com [guro@fb.com: fix indirectly reclaimable memory accounting] Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cbd4a5bc |
|
29-Mar-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
d_genocide: move export to definition Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
42177007 |
|
11-Mar-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
fold dentry_lock_for_move() into its sole caller and clean it up Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
076515fc |
|
10-Mar-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
make non-exchanging __d_move() copy ->d_parent rather than swap them Currently d_move(from, to) does the following: * name/parent of from <- old name/parent of to, from hashed there * to is unhashed * name of to is preserved * if from used to be detached, to gets detached * if from used to be attached, parent of to <- old parent of from. That's both user-visibly bogus and complicates reasoning a lot. Much saner semantics would be * name/parent of from <- name/parent of to, from hashed there. * to is unhashed * name/parent of to is unchanged. The price, of course, is that old parent of from might lose a reference. However, * all potentially cross-directory callers of d_move() have both parents pinned directly; typically, dentries themselves are grabbed only after we have grabbed and locked both parents. IOW, the decrement of old parent's refcount in case of d_move() won't reach zero. * __d_move() from d_splice_alias() is done to detached alias. No refcount decrements in that case * __d_move() from __d_unalias() *can* get the refcount to zero. So let's grab a reference to alias' old parent before calling __d_unalias() and dput() it after we'd dropped rename_lock. That does make d_splice_alias() potentially blocking. However, it has no callers in non-sleepable contexts (and the case where we'd grown that dget/dput pair is _very_ rare, so performance is not an issue). Another thing that needs adjustment is unlocking in the end of __d_move(); folded it in. And cleaned the remnants of bogus ordering from the "lock them in the beginning" counterpart - it's never been right and now (well, for 7 years now) we have that thing always serialized on rename_lock anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7a5cf791 |
|
05-Mar-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
split d_path() and friends into a separate file Those parts of fs/dcache.c are pretty much self-contained. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
43986d63 |
|
25-Feb-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
dcache.c: trim includes Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8f04da2a |
|
22-Feb-2018 |
John Ogness <john.ogness@linutronix.de> |
fs/dcache: Avoid a try_lock loop in shrink_dentry_list() shrink_dentry_list() holds dentry->d_lock and needs to acquire dentry->d_inode->i_lock. This cannot be done with a spin_lock() operation because it's the reverse of the regular lock order. To avoid ABBA deadlocks it is done with a trylock loop. Trylock loops are problematic in two scenarios: 1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are preemptible. As a consequence the i_lock holder can be preempted by a higher priority task. If that task executes the trylock loop it will do so forever and live lock. 2) In virtual machines trylock loops are problematic as well. The VCPU on which the i_lock holder runs can be scheduled out and a task on a different VCPU can loop for a whole time slice. In the worst case this can lead to starvation. Commits 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9 ("shrink_dentry_list(): take parent's d_lock earlier") are addressing exactly those symptoms. Avoid the trylock loop by using dentry_kill(). When pruning ancestors, the same code applies that is used to kill a dentry in dput(). This also has the benefit that the locking order is now the same. First the inode is locked, then the parent. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f657a666 |
|
23-Feb-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of trylock loop around dentry_kill() In case when trylock in there fails, deal with it directly in dentry_kill(). Note that in cases when we drop and retake ->d_lock, we need to recheck whether to retain the dentry. Another thing is that dropping/retaking ->d_lock might have ended up with negative dentry turning into positive; that, of course, can happen only once... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
62d9956c |
|
06-Mar-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
handle move to LRU in retain_dentry() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a338579f |
|
23-Feb-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
dput(): consolidate the "do we need to retain it?" into an inlined helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8b987a46 |
|
23-Feb-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
split the slow part of lock_parent() off Turn the "trylock failed" part into uninlined __lock_parent(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
65d8eb5a |
|
23-Feb-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
now lock_parent() can't run into killed dentry all remaining callers hold either a reference or ->i_lock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3b3f09f4 |
|
23-Feb-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of trylock loop in locking dentries on shrink list In case of trylock failure don't re-add to the list - drop the locks and carefully get them in the right order. For shrink_dentry_list(), somebody having grabbed a reference to dentry means that we can kick it off-list, so if we find dentry being modified under us we don't need to play silly buggers with retries anyway - off the list it is. The locking logics taken out into a helper of its own; lock_parent() is no longer used for dentries that can be killed under us. [fix from Eric Biggers folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c19457f0 |
|
23-Feb-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
d_delete(): get rid of trylock loop just grab ->i_lock first; we have a positive dentry, nothing's going to happen to inode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c1d0c1a2 |
|
22-Feb-2018 |
John Ogness <john.ogness@linutronix.de> |
fs/dcache: Move dentry_kill() below lock_parent() A subsequent patch will modify dentry_kill() to call lock_parent(). Move the dentry_kill() implementation "as is" below lock_parent() first. This will help simplify the review of the subsequent patch with dentry_kill() changes. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
06080d10 |
|
22-Feb-2018 |
John Ogness <john.ogness@linutronix.de> |
fs/dcache: Remove stale comment from dentry_kill() Commit 0d98439ea3c6 ("vfs: use lockred "dead" flag to mark unrecoverably dead dentries") removed the `ref' parameter in dentry_kill() but its documentation remained. Remove it. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0632a9ac |
|
06-Mar-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
take write_seqcount_invalidate() into __d_drop() ... and reorder it with making d_unhashed() true. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8cc07c80 |
|
19-Feb-2018 |
Will Deacon <will@kernel.org> |
fs: dcache: Use READ_ONCE when accessing i_dir_seq i_dir_seq is subject to concurrent modification by a cmpxchg or store-release operation, so ensure that the relaxed access in d_alloc_parallel uses READ_ONCE. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
015555fd |
|
19-Feb-2018 |
Will Deacon <will@kernel.org> |
fs: dcache: Avoid livelock between d_alloc_parallel and __d_add If d_alloc_parallel runs concurrently with __d_add, it is possible for d_alloc_parallel to continuously retry whilst i_dir_seq has been incremented to an odd value by __d_add: CPU0: __d_add n = start_dir_add(dir); cmpxchg(&dir->i_dir_seq, n, n + 1) == n CPU1: d_alloc_parallel retry: seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1; hlist_bl_lock(b); bit_spin_lock(0, (unsigned long *)b); // Always succeeds CPU0: __d_lookup_done(dentry) hlist_bl_lock bit_spin_lock(0, (unsigned long *)b); // Never succeeds CPU1: if (unlikely(parent->d_inode->i_dir_seq != seq)) { hlist_bl_unlock(b); goto retry; } Since the simple bit_spin_lock used to implement hlist_bl_lock does not provide any fairness guarantees, then CPU1 can starve CPU0 of the lock and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot exit its retry loop because the sequence number always has the bottom bit set. This patch resolves the livelock by not taking hlist_bl_lock in d_alloc_parallel if the sequence counter is odd, since any subsequent masked comparison with i_dir_seq will fail anyway. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: Naresh Madhusudana <naresh.madhusudana@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3b821409 |
|
23-Feb-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
lock_parent() needs to recheck if dentry got __dentry_kill'ed under it In case when dentry passed to lock_parent() is protected from freeing only by the fact that it's on a shrink list and trylock of parent fails, we could get hit by __dentry_kill() (and subsequent dentry_kill(parent)) between unlocking dentry and locking presumed parent. We need to recheck that dentry is alive once we lock both it and parent *and* postpone rcu_read_unlock() until after that point. Otherwise we could return a pointer to struct dentry that already is rcu-scheduled for freeing, with ->d_lock held on it; caller's subsequent attempt to unlock it can end up with memory corruption. Cc: stable@vger.kernel.org # 3.12+, counting backports Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
babcbbc7 |
|
01-Feb-2018 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports" This reverts commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd. It's no longer needed since dentry_string_cmp() now uses read_word_at_a_time() to avoid kasan's reports. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bfe7aa6c |
|
01-Feb-2018 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
fs/dcache: Use read_word_at_a_time() in dentry_string_cmp() dentry_string_cmp() performs the word-at-a-time reads from 'cs' and may read slightly more than it was requested in kmallac(). Normally this would make KASAN to report out-of-bounds access, but this was workarounded by commit df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports"). This workaround is not perfect, since it allows out-of-bounds access to dentry's name for all the code, not just in dentry_string_cmp(). So it would be better to use read_word_at_a_time() instead and revert commit df4c0e36f1b1. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b35d786b |
|
20-Nov-2017 |
Alexey Dobriyan <adobriyan@gmail.com> |
dcache: delete unused d_hash_mask Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
854d3e63 |
|
20-Nov-2017 |
Alexey Dobriyan <adobriyan@gmail.com> |
dcache: subtract d_hash_shift from 32 in advance Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f9c34674 |
|
19-Jan-2018 |
Miklos Szeredi <miklos@szeredi.hu> |
vfs: factor out helpers d_instantiate_anon() and d_alloc_anon() Those helpers are going to be used by overlayfs to implement NFS export decode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
e8f9e5b7 |
|
11-Jan-2018 |
Amir Goldstein <amir73il@gmail.com> |
ovl: verify directory index entries on mount Directory index entries should have 'upper' xattr pointing to the real upper dir. Verifying that the upper dir file handle is not stale is expensive, so only verify stale directory index entries on mount if NFS export feature is enabled. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
6a9b8820 |
|
10-Jun-2017 |
David Windsor <dave@nullcore.net> |
vfs: Define usercopy region in names_cache slab caches VFS pathnames are stored in the names_cache slab cache, either inline or across an entire allocation entry (when approaching PATH_MAX). These are copied to/from userspace, so they must be entirely whitelisted. cache object allocation: include/linux/fs.h: #define __getname() kmem_cache_alloc(names_cachep, GFP_KERNEL) example usage trace: strncpy_from_user+0x4d/0x170 getname_flags+0x6f/0x1f0 user_path_at_empty+0x23/0x40 do_mount+0x69/0xda0 SyS_mount+0x83/0xd0 fs/namei.c: getname_flags(...): ... result = __getname(); ... kname = (char *)result->iname; result->name = kname; len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX); ... if (unlikely(len == EMBEDDED_NAME_MAX)) { const size_t size = offsetof(struct filename, iname[1]); kname = (char *)result; result = kzalloc(size, GFP_KERNEL); ... result->name = kname; len = strncpy_from_user(kname, filename, PATH_MAX); In support of usercopy hardening, this patch defines the entire cache object in the names_cache slab cache as whitelisted, since it may entirely hold name strings to be copied to/from userspace. This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, add usage trace] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
80344266 |
|
10-Jun-2017 |
David Windsor <dave@nullcore.net> |
dcache: Define usercopy region in dentry_cache slab cache When a dentry name is short enough, it can be stored directly in the dentry itself (instead in a separate kmalloc allocation). These dentry short names, stored in struct dentry.d_iname and therefore contained in the dentry_cache slab cache, need to be coped to userspace. cache object allocation: fs/dcache.c: __d_alloc(...): ... dentry = kmem_cache_alloc(dentry_cache, ...); ... dentry->d_name.name = dentry->d_iname; example usage trace: filldir+0xb0/0x140 dcache_readdir+0x82/0x170 iterate_dir+0x142/0x1b0 SyS_getdents+0xb5/0x160 fs/readdir.c: (called via ctx.actor by dir_emit) filldir(..., const char *name, ...): ... copy_to_user(..., name, namlen) fs/libfs.c: dcache_readdir(...): ... next = next_positive(dentry, p, 1) ... dir_emit(..., next->d_name.name, ...) In support of usercopy hardening, this patch defines a region in the dentry_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamic copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust hunks for kmalloc-specific things moved later] [kees: adjust commit log, provide usage trace] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
61647823 |
|
09-Nov-2017 |
NeilBrown <neilb@suse.com> |
VFS: close race between getcwd() and d_move() d_move() will call __d_drop() and then __d_rehash() on the dentry being moved. This creates a small window when the dentry appears to be unhashed. Many tests of d_unhashed() are made under ->d_lock and so are safe from racing with this window, but some aren't. In particular, getcwd() calls d_unlinked() (which calls d_unhashed()) without d_lock protection, so it can race. This races has been seen in practice with lustre, which uses d_move() as part of name lookup. See: https://jira.hpdd.intel.com/browse/LU-9735 It could race with a regular rename(), and result in ENOENT instead of either the 'before' or 'after' name. The race can be demonstrated with a simple program which has two threads, one renaming a directory back and forth while another calls getcwd() within that directory: it should never fail, but does. See: https://patchwork.kernel.org/patch/9455345/ We could fix this race by taking d_lock and rechecking when d_unhashed() reports true. Alternately when can remove the window, which is the approach this patch takes. ___d_drop() is introduce which does *not* clear d_hash.pprev so the dentry still appears to be hashed. __d_drop() calls ___d_drop(), then clears d_hash.pprev. __d_move() now uses ___d_drop() and only clears d_hash.pprev when not rehashing. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f1ee6162 |
|
20-Dec-2017 |
NeilBrown <neilb@suse.com> |
VFS: don't keep disconnected dentries on d_anon The original purpose of the per-superblock d_anon list was to keep disconnected dentries in the cache between consecutive requests to the NFS server. Dentries can be disconnected if a client holds a file open and repeatedly performs IO on it, and if the server drops the dentry, whether due to memory pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches". This purpose was thwarted by commit 75a6f82a0d10 ("freeing unlinked file indefinitely delayed") which caused disconnected dentries to be freed as soon as their refcount reached zero. This means that, when a dentry being used by nfsd gets disconnected, a new one needs to be allocated for every request (unless requests overlap). As the dentry has no name, no parent, and no children, there is little of value to cache. As small memory allocations are typically fast (from per-cpu free lists) this likely has little cost. This means that the original purpose of s_anon is no longer relevant: there is no longer any need to keep disconnected dentries on a list so they appear to be hashed. However, s_anon now has a new use. When you mount an NFS filesystem, the dentry stored in s_root is just a placebo. The "real" root dentry is allocated using d_obtain_root() and so it kept on the s_anon list. I don't know the reason for this, but suspect it related to NFSv4 where a mount of "server:/some/path" require NFS to look up the root filehandle on the server, then walk down "/some" and "/path" to get the filehandle to mount. Whatever the reason, NFS depends on the s_anon list and on shrink_dcache_for_umount() pruning all dentries on this list. So we cannot simply remove s_anon. We could just leave the code unchanged, but apart from that being potentially confusing, the (unfair) bit-spin-lock which protects s_anon can become a bottle neck when lots of disconnected dentries are being created. So this patch renames s_anon to s_roots, and stops storing disconnected dentries on the list. Only dentries obtained with d_obtain_root() are now stored on this list. There are many fewer of these (only NFS and NILFS2 use the call, and only during filesystem mount) so contention on the bit-lock will not be a problem. Possibly an alternate solution should be found for NFS and NILFS2, but that would require understanding their needs first. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3382290e |
|
24-Oct-2017 |
Will Deacon <will@kernel.org> |
locking/barriers: Convert users of lockless_dereference() to READ_ONCE() [ Note, this is a Git cherry-pick of the following commit: 506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()") ... for easier x86 PTI code testing and back-porting. ] READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it can be used instead of lockless_dereference() without any change in semantics. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
9c565035 |
|
17-Nov-2017 |
Yang Shi <yang.s@alibaba-inc.com> |
vfs: remove unused hardirq.h Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by vfs at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7088efa9 |
|
09-Oct-2017 |
Paul E. McKenney <paulmck@kernel.org> |
fs/dcache: Use release-acquire for name/length update The code in __d_alloc() carefully orders filling in the NUL character of the name (and the length, hash, and the name itself) with assigning of the name itself. However, prepend_name() does not order the accesses to the ->name and ->len fields, other than on TSO systems. This commit therefore replaces prepend_name()'s READ_ONCE() of ->name with an smp_load_acquire(), which orders against the subsequent READ_ONCE() of ->len. Because READ_ONCE() now incorporates smp_read_barrier_depends(), prepend_name()'s smp_read_barrier_depends() is removed. Finally, to save a line, the smp_wmb()/store pair in __d_alloc() is replaced by smp_store_release(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <linux-fsdevel@vger.kernel.org>
|
#
49502766 |
|
15-Nov-2017 |
Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> |
kmemcheck: remove annotations Patch series "kmemcheck: kill kmemcheck", v2. As discussed at LSF/MM, kill kmemcheck. KASan is a replacement that is able to work without the limitation of kmemcheck (single CPU, slow). KASan is already upstream. We are also not aware of any users of kmemcheck (or users who don't consider KASan as a suitable replacement). The only objection was that since KASAN wasn't supported by all GCC versions provided by distros at that time we should hold off for 2 years, and try again. Now that 2 years have passed, and all distros provide gcc that supports KASAN, kill kmemcheck again for the very same reasons. This patch (of 4): Remove kmemcheck annotations, and calls to kmemcheck from the kernel. [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
66702eb5 |
|
23-Oct-2017 |
Mark Rutland <mark.rutland@arm.com> |
locking/atomics, fs/dcache: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't currently harmful. However, for some features it is necessary to instrument reads and writes separately, which is not possible with ACCESS_ONCE(). This distinction is critical to correct operation. It's possible to transform the bulk of kernel code using the Coccinelle script below. However, this doesn't handle comments, leaving references to ACCESS_ONCE() instances which have been removed. As a preparatory step, this patch converts the dcache code and comments to use {READ,WRITE}_ONCE() consistently. ---- virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-4-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
506458ef |
|
24-Oct-2017 |
Will Deacon <will@kernel.org> |
locking/barriers: Convert users of lockless_dereference() to READ_ONCE() READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it can be used instead of lockless_dereference() without any change in semantics. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
b17c070f |
|
10-Jul-2017 |
Sahitya Tummala <stummala@codeaurora.org> |
fs/dcache.c: fix spin lockup issue on nlru->lock __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer duration if there are more number of items in the lru list. As per the current code, it can hold the spin lock for upto maximum UINT_MAX entries at a time. So if there are more number of items in the lru list, then "BUG: spinlock lockup suspected" is observed in the below path: spin_bug+0x90 do_raw_spin_lock+0xfc _raw_spin_lock+0x28 list_lru_add+0x28 dput+0x1c8 path_put+0x20 terminate_walk+0x3c path_lookupat+0x100 filename_lookup+0x6c user_path_at_empty+0x54 SyS_faccessat+0xd0 el0_svc_naked+0x24 This nlru->lock is acquired by another CPU in this path - d_lru_shrink_move+0x34 dentry_lru_isolate_shrink+0x48 __list_lru_walk_one.isra.10+0x94 list_lru_walk_node+0x40 shrink_dcache_sb+0x60 do_remount_sb+0xbc do_emergency_remount+0xb0 process_one_work+0x228 worker_thread+0x2e0 kthread+0xf4 ret_from_fork+0x10 Fix this lockup by reducing the number of entries to be shrinked from the lru list to 1024 at once. Also, add cond_resched() before processing the lru list again. Link: http://marc.info/?t=149722864900001&r=1&w=2 Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
49d31c2f |
|
07-Jul-2017 |
Al Viro <viro@zeniv.linux.org.uk> |
dentry name snapshots take_dentry_name_snapshot() takes a safe snapshot of dentry name; if the name is a short one, it gets copied into caller-supplied structure, otherwise an extra reference to external name is grabbed (those are never modified). In either case the pointer to stable string is stored into the same structure. dentry must be held by the caller of take_dentry_name_snapshot(), but may be freely dropped afterwards - the snapshot will stay until destroyed by release_dentry_name_snapshot(). Intended use: struct name_snapshot s; take_dentry_name_snapshot(&s, dentry); ... access s.name ... release_dentry_name_snapshot(&s); Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name to pass down with event. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3d375d78 |
|
06-Jul-2017 |
Pavel Tatashin <pasha.tatashin@oracle.com> |
mm: update callers to use HASH_ZERO flag Update dcache, inode, pid, mountpoint, and mount hash tables to use HASH_ZERO, and remove initialization after allocations. In case of places where HASH_EARLY was used such as in __pv_init_lock_hash the zeroed hash table was already assumed, because memblock zeroes the memory. CPU: SPARC M6, Memory: 7T Before fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 11.798s After fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 3.198s CPU: Intel Xeon E5-2630, Memory: 2.2T: Before fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.245s After fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.244s Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cdf01226 |
|
04-Jul-2017 |
David Howells <dhowells@redhat.com> |
VFS: Provide empty name qstr Provide an empty name (ie. "") qstr for general use. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6916363f |
|
27-Jun-2017 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
fs/dcache: init in_lookup_hashtable in_lookup_hashtable was introduced in commit 94bdd655caba ("parallel lookups machinery, part 3") and never initialized but since it is in the data it is all zeros. But we need this for -RT. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
81be24d2 |
|
03-Jun-2017 |
Al Viro <viro@ZenIV.linux.org.uk> |
Hang/soft lockup in d_invalidate with simultaneous calls It's not hard to trigger a bunch of d_invalidate() on the same dentry in parallel. They end up fighting each other - any dentry picked for removal by one will be skipped by the rest and we'll go for the next iteration through the entire subtree, even if everything is being skipped. Morevoer, we immediately go back to scanning the subtree. The only thing we really need is to dissolve all mounts in the subtree and as soon as we've nothing left to do, we can just unhash the dentry and bugger off. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
563f4001 |
|
18-Apr-2017 |
Josef Bacik <josef@toxicpanda.com> |
fs: don't set *REFERENCED on single use objects By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or inode we create. This is problematic as this means that it takes two trips through the LRU for any of these objects to be reclaimed, regardless of their actual lifetime. With enough pressure from these caches we can easily evict our working set from page cache with single use objects. So instead only set *REFERENCED if we've already been added to the LRU list. This means that we've been touched since the first time we were accessed, and so more likely to need to hang out in cache. To illustrate this issue I wrote the following scripts https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure on my test box. It is a single socket 4 core CPU with 16gib of RAM and I tested on an Intel 2tib NVME drive. The cache-pressure.sh script creates a new file system and creates 2 6.5gib files in order to take up 13gib of the 16gib of ram with pagecache. Then it runs a test program that reads these 2 files in a loop, and keeps track of how often it has to read bytes for each loop. On an ideal system with no pressure we should have to read 0 bytes indefinitely. The second thing this script does is start a fs_mark job that creates a ton of 0 length files, putting pressure on the system with slab only allocations. On exit the script prints out how many bytes were read by the read-file program. The results are as follows Without patch: /mnt/btrfs-test/reads/file1: total read during loops 27262988288 /mnt/btrfs-test/reads/file2: total read during loops 27262976000 With patch: /mnt/btrfs-test/reads/file2: total read during loops 18640457728 /mnt/btrfs-test/reads/file1: total read during loops 9565376512 This patch results in a 50% reduction of the amount of pages evicted from our working set. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3895dbf8 |
|
02-Jan-2017 |
Eric W. Biederman <ebiederm@xmission.com> |
mnt: Protect the mountpoint hashtable with mount_lock Protecting the mountpoint hashtable with namespace_sem was sufficient until a call to umount_mnt was added to mntput_no_expire. At which point it became possible for multiple calls of put_mountpoint on the same hash chain to happen on the same time. Kristen Johansen <kjlx@templeofstupid.com> reported: > This can cause a panic when simultaneous callers of put_mountpoint > attempt to free the same mountpoint. This occurs because some callers > hold the mount_hash_lock, while others hold the namespace lock. Some > even hold both. > > In this submitter's case, the panic manifested itself as a GP fault in > put_mountpoint() when it called hlist_del() and attempted to dereference > a m_hash.pprev that had been poisioned by another thread. Al Viro observed that the simple fix is to switch from using the namespace_sem to the mount_lock to protect the mountpoint hash table. I have taken Al's suggested patch moved put_mountpoint in pivot_root (instead of taking mount_lock an additional time), and have replaced new_mountpoint with get_mountpoint a function that does the hash table lookup and addition under the mount_lock. The introduction of get_mounptoint ensures that only the mount_lock is needed to manipulate the mountpoint hashtable. d_set_mounted is modified to only set DCACHE_MOUNTED if it is not already set. This allows get_mountpoint to use the setting of DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry happens exactly once. Cc: stable@vger.kernel.org Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts") Reported-by: Krister Johansen <kjlx@templeofstupid.com> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
#
7c0f6ba6 |
|
24-Dec-2016 |
Linus Torvalds <torvalds@linux-foundation.org> |
Replace <asm/uaccess.h> with <linux/uaccess.h> globally This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f74e7b33 |
|
23-Nov-2016 |
Ian Kent <ikent@redhat.com> |
vfs: remove unused have_submounts() function Now that path_has_submounts() has been added have_submounts() is no longer used so remove it. Link: http://lkml.kernel.org/r/20161011053428.27645.12310.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
01619491 |
|
23-Nov-2016 |
Ian Kent <ikent@redhat.com> |
vfs: add path_has_submounts() d_mountpoint() can only be used reliably to establish if a dentry is not mounted in any namespace. It isn't aware of the possibility there may be multiple mounts using the given dentry, possibly in a different namespace. Add function, path_has_submounts(), that checks is a struct path contains mounts (or is a mountpoint itself) to handle this case. Link: http://lkml.kernel.org/r/20161011053403.27645.55242.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6fa67e70 |
|
31-Jul-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of 'parent' argument of ->d_compare() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
15d3c589 |
|
29-Jul-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
fold _d_rehash() and __d_rehash() together The only place where we feed to __d_rehash() something other than d_hash(dentry->d_name.hash) is __d_move(), where we give it d_hash of another dentry. Postpone rehashing until we'd switched the names and we are rid of that exception, along with the need to keep _d_rehash() and __d_rehash() separate. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d614146d |
|
28-Jul-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
fold dentry_rcuwalk_invalidate() into its only remaining caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
47be6184 |
|
05-Jul-2016 |
Wei Fang <fangwei1@huawei.com> |
fs/dcache.c: avoid soft-lockup in dput() We triggered soft-lockup under stress test which open/access/write/close one file concurrently on more than five different CPUs: WARN: soft lockup - CPU#0 stuck for 11s! [who:30631] ... [<ffffffc0003986f8>] dput+0x100/0x298 [<ffffffc00038c2dc>] terminate_walk+0x4c/0x60 [<ffffffc00038f56c>] path_lookupat+0x5cc/0x7a8 [<ffffffc00038f780>] filename_lookup+0x38/0xf0 [<ffffffc000391180>] user_path_at_empty+0x78/0xd0 [<ffffffc0003911f4>] user_path_at+0x1c/0x28 [<ffffffc00037d4fc>] SyS_faccessat+0xb4/0x230 ->d_lock trylock may failed many times because of concurrently operations, and dput() may execute a long time. Fix this by replacing cpu_relax() with cond_resched(). dput() used to be sleepable, so make it sleepable again should be safe. Cc: <stable@vger.kernel.org> Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
285b102d |
|
28-Jun-2016 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: new d_init method Allow filesystem to initialize dentry at allocation time. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9aba36de |
|
20-Jul-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
qstr constify instances in fs/dcache.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d4c91a8f |
|
25-Jun-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: d_same_name() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ae0a843c |
|
26-Mar-2016 |
He Kuang <hekuang@huawei.com> |
dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends() lockless_dereference() was added which can be used in place of hard-coding smp_read_barrier_depends(). Signed-off-by: He Kuang <hekuang@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2d902671 |
|
30-Jun-2016 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: merge .d_select_inode() into .d_real() The two methods essentially do the same: find the real dentry/inode belonging to an overlay dentry. The difference is in the usage: vfs_open() uses ->d_select_inode() and expects the function to perform copy-up if necessary based on the open flags argument. file_dentry() uses ->d_real() passing in the overlay dentry as well as the underlying inode. vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real() with a zero inode would have worked just as well here. This patch merges the functionality of ->d_select_inode() into ->d_real() by adding an 'open_flags' argument to the latter. [Al Viro] Make the signature of d_real() match that of ->d_real() again. And constify the inode argument, while we are at it. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
e7d6ef97 |
|
19-Jun-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
fix idiotic braino in d_alloc_parallel() Check for d_unhashed() while searching in in-lookup hash was absolutely wrong. Worse, it masked a deadlock on dget() done under bitlock that nests inside ->d_lock. Thanks to J. R. Okajima for spotting it. Spotted-by: "J. R. Okajima" <hooanon05g@gmail.com> Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
703b5faf |
|
09-Jun-2016 |
George Spelvin <linux@sciencehorizons.net> |
fs/dcache.c: Save one 32-bit multiply in dcache lookup Noe that we're mixing in the parent pointer earlier, we don't need to use hash_32() to mix its bits. Instead, we can just take the msbits of the hash value directly. For those applications which use the partial_name_hash(), move the multiply to end_name_hash. Signed-off-by: George Spelvin <linux@sciencehorizons.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8387ff25 |
|
10-Jun-2016 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: make the string hashes salt the hash We always mixed in the parent pointer into the dentry name hash, but we did it late at lookup time. It turns out that we can simplify that lookup-time action by salting the hash with the parent pointer early instead of late. A few other users of our string hashes also wanted to mix in their own pointers into the hash, and those are updated to use the same mechanism. Hash users that don't have any particular initial salt can just use the NULL pointer as a no-salt. Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ba65dc5e |
|
10-Jun-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
much milder d_walk() race d_walk() relies upon the tree not getting rearranged under it without rename_lock being touched. And we do grab rename_lock around the places that change the tree topology. Unfortunately, branch reordering is just as bad from d_walk() POV and we have two places that do it without touching rename_lock - one in handling of cursors (for ramfs-style directories) and another in autofs. autofs one is a separate story; this commit deals with the cursors. * mark cursor dentries explicitly at allocation time * make __dentry_kill() leave ->d_child.next pointing to the next non-cursor sibling, making sure that it won't be moved around unnoticed before the parent is relocked on ascend-to-parent path in d_walk(). * make d_walk() skip cursors explicitly; strictly speaking it's not necessary (all callbacks we pass to d_walk() are no-ops on cursors), but it makes analysis easier. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3d56c25e |
|
07-Jun-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
fix d_walk()/non-delayed __d_free() race Ascend-to-parent logics in d_walk() depends on all encountered child dentries not getting freed without an RCU delay. Unfortunately, in quite a few cases it is not true, with hard-to-hit oopsable race as the result. Fortunately, the fix is simiple; right now the rule is "if it ever been hashed, freeing must be delayed" and changing it to "if it ever had a parent, freeing must be delayed" closes that hole and covers all cases the old rule used to cover. Moreover, pipes and sockets remain _not_ covered, so we do not introduce RCU delay in the cases which are the reason for having that delay conditional in the first place. Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
550dce01 |
|
29-May-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
unify dentry_iput() and dentry_unlink_inode() There is a lot of duplication between dentry_unlink_inode() and dentry_iput(). The only real difference is that dentry_unlink_inode() bumps ->d_seq and dentry_iput() doesn't. The argument of the latter is known to have been unhashed, so anybody who might've found it in RCU lookup would already be doomed to a ->d_seq mismatch. And we want to avoid pointless smp_rmb() there. This patch makes dentry_unlink_inode() bump ->d_seq only for hashed dentries. It's safe (d_delete() calls that sucker only if we are holding the only reference to dentry, so rehash is not going to happen) and it allows to use dentry_unlink_inode() in __dentry_kill() and get rid of dentry_iput(). The interesting question here is profiling; it *is* a hot path, and extra conditional jumps in there might or might not be painful. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
affda484 |
|
29-May-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
trim fsnotify hooks a bit fsnotify_d_move()/__fsnotify_d_instantiate()/__fsnotify_update_dcache_flags() are identical to each other, regardless of the config. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fcfd2fbf |
|
20-May-2016 |
George Spelvin <linux@sciencehorizons.net> |
fs/namei.c: Add hashlen_string() function We'd like to make more use of the highly-optimized dcache hash functions throughout the kernel, rather than have every subsystem create its own, and a function that hashes basic null-terminated strings is required for that. (The name is to emphasize that it returns both hash and length.) It's actually useful in the dcache itself, specifically d_alloc_name(). Other uses in the next patch. full_name_hash() is also tweaked to make it more generally useful: 1) Take a "char *" rather than "unsigned char *" argument, to be consistent with hash_name(). 2) Handle zero-length inputs. If we want more callers, we don't want to make them worry about corner cases. Signed-off-by: George Spelvin <linux@sciencehorizons.net>
|
#
9902af79 |
|
15-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
parallel lookups: actual switch to rwsem ta-da! The main issue is the lack of down_write_killable(), so the places like readdir.c switched to plain inode_lock(); once killable variants of rwsem primitives appear, that'll be dealt with. lockdep side also might need more work Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d9171b93 |
|
15-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
parallel lookups machinery, part 4 (and last) If we *do* run into an in-lookup match, we need to wait for it to cease being in-lookup. Fortunately, we do have unused space in in-lookup dentries - d_lru is never looked at until it stops being in-lookup. So we can stash a pointer to wait_queue_head from stack frame of the caller of ->lookup(). Some precautions are needed while waiting, but it's not that hard - we do hold a reference to dentry we are waiting for, so it can't go away. If it's found to be in-lookup the wait_queue_head is still alive and will remain so at least while ->d_lock is held. Moreover, the condition we are waiting for becomes true at the same point where everything on that wq gets woken up, so we can just add ourselves to the queue once. d_alloc_parallel() gets a pointer to wait_queue_head_t from its caller; lookup_slow() adjusted, d_add_ci() taught to use d_alloc_parallel() if the dentry passed to it happens to be in-lookup one (i.e. if it's been called from the parallel lookup). That's pretty much it - all that remains is to switch ->i_mutex to rwsem and have lookup_slow() take it shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
94bdd655 |
|
15-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
parallel lookups machinery, part 3 We will need to be able to check if there is an in-lookup dentry with matching parent/name. Right now it's impossible, but as soon as start locking directories shared such beasts will appear. Add a secondary hash for locating those. Hash chains go through the same space where d_alias will be once it's not in-lookup anymore. Search is done under the same bitlock we use for modifications - with the primary hash we can rely on d_rehash() into the wrong chain being the worst that could happen, but here the pointers are buggered once it's removed from the chain. On the other hand, the chains are not going to be long and normally we'll end up adding to the chain anyway. That allows us to avoid bothering with ->d_lock when doing the comparisons - everything is stable until removed from chain. New helper: d_alloc_parallel(). Right now it allocates, verifies that no hashed and in-lookup matches exist and adds to in-lookup hash. Returns ERR_PTR() for error, hashed match (in the unlikely case it's been found) or new dentry. In-lookup matches trigger BUG() for now; that will change in the next commit when we introduce waiting for ongoing lookup to finish. Note that in-lookup matches won't be possible until we actually go for shared locking. lookup_slow() switched to use of d_alloc_parallel(). Again, these commits are separated only for making it easier to review. All this machinery will start doing something useful only when we go for shared locking; it's just that the combination is too large for my taste. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
84e710da |
|
14-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
parallel lookups machinery, part 2 We'll need to verify that there's neither a hashed nor in-lookup dentry with desired parent/name before adding to in-lookup set. One possible solution would be to hold the parent's ->d_lock through both checks, but while the in-lookup set is relatively small at any time, dcache is not. And holding the parent's ->d_lock through something like __d_lookup_rcu() would suck too badly. So we leave the parent's ->d_lock alone, which means that we watch out for the following scenario: * we verify that there's no hashed match * existing in-lookup match gets hashed by another process * we verify that there's no in-lookup matches and decide that everything's fine. Solution: per-directory kinda-sorta seqlock, bumped around the times we hash something that used to be in-lookup or move (and hash) something in place of in-lookup. Then the above would turn into * read the counter * do dcache lookup * if no matches found, check for in-lookup matches * if there had been none of those either, check if the counter has changed; repeat if it has. The "kinda-sorta" part is due to the fact that we don't have much spare space in inode. There is a spare word (shared with i_bdev/i_cdev/i_pipe), so the counter part is not a problem, but spinlock is a different story. We could use the parent's ->d_lock, and it would be less painful in terms of contention, for __d_add() it would be rather inconvenient to grab; we could do that (using lock_parent()), but... Fortunately, we can get serialization on the counter itself, and it might be a good idea in general; we can use cmpxchg() in a loop to get from even to odd and smp_store_release() from odd to even. This commit adds the counter and updating logics; the readers will be added in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
85c7f810 |
|
14-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
beginning of transition to parallel lookups - marking in-lookup dentries marked as such when (would be) parallel lookup is about to pass them to actual ->lookup(); unmarked when * __d_add() is about to make it hashed, positive or not. * __d_move() (from d_splice_alias(), directly or via __d_unalias()) puts a preexisting dentry in its place * in caller of ->lookup() if it has escaped all of the above. Bug (WARN_ON, actually) if it reaches the final dput() or d_instantiate() while still marked such. As the result, we are guaranteed that for as long as the flag is set, dentry will * remain negative unhashed with positive refcount * never have its ->d_alias looked at * never have its ->d_lru looked at * never have its ->d_parent and ->d_name changed Right now we have at most one such for any given parent directory. With parallel lookups that restriction will weaken to * only exist when parent is locked shared * at most one with given (parent,name) pair (comparison of names is according to ->d_compare()) * only exist when there's no hashed dentry with the same (parent,name) Transition will take the next several commits; unfortunately, we'll only be able to switch to rwsem at the end of this series. The reason for not making it a single patch is to simplify review. New primitives: d_in_lookup() (a predicate checking if dentry is in the in-lookup state) and d_lookup_done() (tells the system that we are done with lookup and if it's still marked as in-lookup, it should cease to be such). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0568d705 |
|
14-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
__d_add(): don't drop/regain ->d_lock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b9680917 |
|
10-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
security_d_instantiate(): move to the point prior to attaching dentry to inode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
798434bd |
|
24-Mar-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
__d_alloc(): treat NULL name as QSTR("/", 1) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d101a125 |
|
26-Mar-2016 |
Miklos Szeredi <miklos@szeredi.hu> |
fs: add file_dentry() This series fixes bugs in nfs and ext4 due to 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay"). Regular files opened on overlayfs will result in the file being opened on the underlying filesystem, while f_path points to the overlayfs mount/dentry. This confuses filesystems which get the dentry from struct file and assume it's theirs. Add a new helper, file_dentry() [*], to get the filesystem's own dentry from the file. This checks file->f_path.dentry->d_flags against DCACHE_OP_REAL, and returns file->f_path.dentry if DCACHE_OP_REAL is not set (this is the common, non-overlayfs case). In the uncommon case it will call into overlayfs's ->d_real() to get the underlying dentry, matching file_inode(file). The reason we need to check against the inode is that if the file is copied up while being open, d_real() would return the upper dentry, while the open file comes from the lower dentry. [*] If possible, it's better simply to use file_inode() instead. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: <stable@vger.kernel.org> # v4.2 Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Daniel Axtens <dja@axtens.net>
|
#
ed782b5a |
|
09-Mar-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
dcache.c: new helper: __d_add() d_add() with inode->i_lock already held; common to d_add() and d_splice_alias(). All ->lookup() instances that end up hashing the dentry they are given will hash it here. This almost completes the preparations to parallel lookups proper - the only remaining bit is taking security_d_instantiate() past d_rehash() and doing rehashing without dropping ->d_lock. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
de689f5e |
|
09-Mar-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
don't bother with __d_instantiate(dentry, NULL) it's a no-op - bumping ->d_seq is pointless there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
27f203f6 |
|
09-Mar-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
untangle fsnotify_d_instantiate() a bit First of all, don't bother calling it if inode is NULL - that makes inode argument unused. Moreover, do it *before* dropping ->d_lock, not right after that (and don't bother grabbing ->d_lock in it, of course). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
34d0d19d |
|
08-Mar-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
uninline d_add() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
668d0cd5 |
|
07-Mar-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
replace d_add_unique() with saner primitive new primitive: d_exact_alias(dentry, inode). If there is an unhashed dentry with the same name/parent and given inode, rehash, grab and return it. Otherwise, return NULL. The only caller of d_add_unique() switched to d_exact_alias() + d_splice_alias(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a528aca7 |
|
28-Feb-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
use ->d_seq to get coherency between ->d_inode and ->d_flags Games with ordering and barriers are way too brittle. Just bump ->d_seq before and after updating ->d_inode and ->d_flags type bits, so that verifying ->d_seq would guarantee they are coherent. Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5955102c |
|
22-Jan-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
wrappers for ->i_mutex access parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5d097056 |
|
14-Jan-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
kmemcg: account certain kmem allocations to memcg Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6b255391 |
|
17-Nov-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
replace ->follow_link() with new method that could stay in RCU mode new method: ->get_link(); replacement of ->follow_link(). The differences are: * inode and dentry are passed separately * might be called both in RCU and non-RCU mode; the former is indicated by passing it a NULL dentry. * when called that way it isn't allowed to block and should return ERR_PTR(-ECHILD) if it needs to be called in non-RCU mode. It's a flagday change - the old method is gone, all in-tree instances converted. Conversion isn't hard; said that, so far very few instances do not immediately bail out when called in RCU mode. That'll change in the next commits. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a6e5787f |
|
16-Nov-2015 |
Yaowei Bai <baiyaowei@cmss.chinamobile.com> |
fs/dcache.c: is_subdir can be boolean This patch makes is_subdir return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a03e283b |
|
15-Aug-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
dcache: Reduce the scope of i_lock in d_splice_alias i_lock is only needed until __d_find_any_alias calls dget on the alias dentry. After that the reference to new ensures that dentry_kill and d_delete will not remove the inode from the dentry, and remove the dentry from the inode->d_entry list. The inode i_lock came to be held over the the __d_move calls in d_splice_alias through a series of introduction of locks with increasing smaller scope. First it was the dcache_lock, then it was the dcache_inode_lock, and finally inode->i_lock. Furthermore inode->i_lock is not held over any other calls to d_move or __d_move so it can not provide any meaningful rename protection. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
cde93be4 |
|
15-Aug-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
dcache: Handle escaped paths in prepend_path A rename can result in a dentry that by walking up d_parent will never reach it's mnt_root. For lack of a better term I call this an escaped path. prepend_path is called by four different functions __d_path, d_absolute_path, d_path, and getcwd. __d_path only wants to see paths are connected to the root it passes in. So __d_path needs prepend_path to return an error. d_absolute_path similarly wants to see paths that are connected to some root. Escaped paths are not connected to any mnt_root so d_absolute_path needs prepend_path to return an error greater than 1. So escaped paths will be treated like paths on lazily unmounted mounts. getcwd needs to prepend "(unreachable)" so getcwd also needs prepend_path to return an error. d_path is the interesting hold out. d_path just wants to print something, and does not care about the weird cases. Which raises the question what should be printed? Given that <escaped_path>/<anything> should result in -ENOENT I believe it is desirable for escaped paths to be printed as empty paths. As there are not really any meaninful path components when considered from the perspective of a mount tree. So tweak prepend_path to return an empty path with an new error code of 3 when it encounters an escaped path. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4248b0da |
|
06-Aug-2015 |
Mel Gorman <mgorman@suse.de> |
fs, file table: reinit files_stat.max_files after deferred memory initialisation Dave Hansen reported the following; My laptop has been behaving strangely with 4.2-rc2. Once I log in to my X session, I start getting all kinds of strange errors from applications and see this in my dmesg: VFS: file-max limit 8192 reached The problem is that the file-max is calculated before memory is fully initialised and miscalculates how much memory the kernel is using. This patch recalculates file-max after deferred memory initialisation. Note that using memory hotplug infrastructure would not have avoided this problem as the value is not recalculated after memory hot-add. 4.1: files_stat.max_files = 6582781 4.2-rc2: files_stat.max_files = 8192 4.2-rc2 patched: files_stat.max_files = 6562467 Small differences with the patch applied and 4.1 but not enough to matter. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Hansen <dave.hansen@intel.com> Cc: Nicolai Stange <nicstange@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Alex Ng <alexng@microsoft.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
75a6f82a |
|
07-Jul-2015 |
Al Viro <viro@ZenIV.linux.org.uk> |
freeing unlinked file indefinitely delayed Normally opening a file, unlinking it and then closing will have the inode freed upon close() (provided that it's not otherwise busy and has no remaining links, of course). However, there's one case where that does *not* happen. Namely, if you open it by fhandle with cold dcache, then unlink() and close(). In normal case you get d_delete() in unlink(2) notice that dentry is busy and unhash it; on the final dput() it will be forcibly evicted from dcache, triggering iput() and inode removal. In this case, though, we end up with *two* dentries - disconnected (created by open-by-fhandle) and regular one (used by unlink()). The latter will have its reference to inode dropped just fine, but the former will not - it's considered hashed (it is on the ->s_anon list), so it will stay around until the memory pressure will finally do it in. As the result, we have the final iput() delayed indefinitely. It's trivial to reproduce - void flush_dcache(void) { system("mount -o remount,rw /"); } static char buf[20 * 1024 * 1024]; main() { int fd; union { struct file_handle f; char buf[MAX_HANDLE_SZ]; } x; int m; x.f.handle_bytes = sizeof(x); chdir("/root"); mkdir("foo", 0700); fd = open("foo/bar", O_CREAT | O_RDWR, 0600); close(fd); name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); flush_dcache(); fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); unlink("foo/bar"); write(fd, buf, sizeof(buf)); system("df ."); /* 20Mb eaten */ close(fd); system("df ."); /* should've freed those 20Mb */ flush_dcache(); system("df ."); /* should be the same as #2 */ } will spit out something like Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 283282 21692 93% / - inode gets freed only when dentry is finally evicted (here we trigger than by remount; normally it would've happened in response to memory pressure hell knows when). Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/ Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
93e3bce6 |
|
24-May-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
vfs: Remove incorrect debugging WARN in prepend_path The warning message in prepend_path is unclear and outdated. It was added as a warning that the mechanism for generating names of pseudo files had been removed from prepend_path and d_dname should be used instead. Unfortunately the warning reads like a general warning, making it unclear what to do with it. Remove the warning. The transition it was added to warn about is long over, and I added code several years ago which in rare cases causes the warning to fire on legitimate code, and the warning is now firing and scaring people for no good reason. Cc: stable@vger.kernel.org Reported-by: Ivan Delalande <colona@arista.com> Reported-by: Omar Sandoval <osandov@osandov.com> Fixes: f48cfddc6729e ("vfs: In d_path don't call d_dname on a mount point") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
#
4bacc9c9 |
|
18-Jun-2015 |
David Howells <dhowells@redhat.com> |
overlayfs: Make f_path always point to the overlay and f_inode to the underlay Make file->f_path always point to the overlay dentry so that the path in /proc/pid/fd is correct and to ensure that label-based LSMs have access to the overlay as well as the underlay (path-based LSMs probably don't need it). Using my union testsuite to set things up, before the patch I see: [root@andromeda union-testsuite]# bash 5</mnt/a/foo107 [root@andromeda union-testsuite]# ls -l /proc/$$/fd/ ... lr-x------. 1 root root 64 Jun 5 14:38 5 -> /a/foo107 [root@andromeda union-testsuite]# stat /mnt/a/foo107 ... Device: 23h/35d Inode: 13381 Links: 1 ... [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5 ... Device: 23h/35d Inode: 13381 Links: 1 ... After the patch: [root@andromeda union-testsuite]# bash 5</mnt/a/foo107 [root@andromeda union-testsuite]# ls -l /proc/$$/fd/ ... lr-x------. 1 root root 64 Jun 5 14:22 5 -> /mnt/a/foo107 [root@andromeda union-testsuite]# stat /mnt/a/foo107 ... Device: 23h/35d Inode: 40346 Links: 1 ... [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5 ... Device: 23h/35d Inode: 40346 Links: 1 ... Note the change in where /proc/$$/fd/5 points to in the ls command. It was pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107 (which is correct). The inode accessed, however, is the lower layer. The union layer is on device 25h/37d and the upper layer on 24h/36d. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a7c6f571 |
|
11-Jun-2015 |
Peter Zijlstra <peterz@infradead.org> |
seqcount: Rename write_seqcount_barrier() I'll shortly be introducing another seqcount primitive that's useful to provide ordering semantics and would like to use the write_seqcount_barrier() name for that. Seeing how there's only one user of the current primitive, lets rename it to invalidate, as that appears what its doing. While there, employ lockdep_assert_held() instead of assert_spin_locked() to not generate debug code for regular kernels. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: Oleg Nesterov <oleg@redhat.com> Cc: wanpeng.li@linux.intel.com Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124743.279926217@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
#
2159184e |
|
28-May-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
d_walk() might skip too much when we find that a child has died while we'd been trying to ascend, we should go into the first live sibling itself, rather than its sibling. Off-by-one in question had been introduced in "deal with deadlock in d_walk()" and the fix needs to be backported to all branches this one has been backported to. Cc: stable@vger.kernel.org # 3.2 and later Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4bf46a27 |
|
05-Mar-2015 |
David Howells <dhowells@redhat.com> |
VFS: Impose ordering on accesses of d_inode and d_flags Impose ordering on accesses of d_inode and d_flags to avoid the need to do this: if (!dentry->d_inode || d_is_negative(dentry)) { when this: if (d_is_negative(dentry)) { should suffice. This check is especially problematic if a dentry can have its type field set to something other than DENTRY_MISS_TYPE when d_inode is NULL (as in unionmount). What we really need to do is stick a write barrier between setting d_inode and setting d_flags and a read barrier between reading d_flags and reading d_inode. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3d330dc1 |
|
10-Feb-2015 |
J. Bruce Fields <bfields@redhat.com> |
dcache: return -ESTALE not -EBUSY on distributed fs race On a distributed filesystem it's possible for lookup to discover that a directory it just found is already cached elsewhere in the directory heirarchy. The dcache won't let us keep the directory in both places, so we have to move the dentry to the new location from the place we previously had it cached. If the parent has changed, then this requires all the same locks as we'd need to do a cross-directory rename. But we're already in lookup holding one parent's i_mutex, so it's too late to acquire those locks in the right order. The (unreliable) solution in __d_unalias is to trylock() the required locks and return -EBUSY if it fails. I see no particular reason for returning -EBUSY, and -ESTALE is already the result of some other lookup races on NFS. I think -ESTALE is the more helpful error return. It also allows us to take advantage of the logic Jeff Layton added in c6a9428401c0 "vfs: fix renameat to retry on ESTALE errors" and ancestors, which hopefully resolves some of these errors before they're returned to userspace. I can reproduce these cases using NFS with: ssh root@$client ' mount -olookupcache=pos '$server':'$export' /mnt/ mkdir /mnt/TO mkdir /mnt/DIR touch /mnt/DIR/test.txt while true; do strace -e open cat /mnt/DIR/test.txt 2>&1 | grep EBUSY done ' ssh root@$server ' while true; do mv $export/DIR $export/TO/DIR mv $export/TO/DIR $export/DIR done ' It also helps to add some other concurrent use of the directory on the client (e.g., "ls /mnt/TO"). And you can replace the server-side mv's by client-side mv's that are repeatedly killed. (If the client is interrupted while waiting for the RENAME response then it's left with a dentry that has to go under one parent or the other, but it doesn't yet know which.) Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
44bdb5e5 |
|
28-Jan-2015 |
David Howells <dhowells@redhat.com> |
VFS: Split DCACHE_FILE_TYPE into regular and special types Split DCACHE_FILE_TYPE into DCACHE_REGULAR_TYPE (dentries representing regular files) and DCACHE_SPECIAL_TYPE (representing blockdev, chardev, FIFO and socket files). d_is_reg() and d_is_special() are added to detect these subtypes and d_is_file() is left as the union of the two. This allows a number of places that use S_ISREG(dentry->d_inode->i_mode) to use d_is_reg(dentry) instead. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
df1a085a |
|
28-Jan-2015 |
David Howells <dhowells@redhat.com> |
VFS: Add a fallthrough flag for marking virtual dentries Add a DCACHE_FALLTHRU flag to indicate that, in a layered filesystem, this is a virtual dentry that covers another one in a lower layer that should be used instead. This may be recorded on medium if directory integration is stored there. The flag can be set with d_set_fallthru() and tested with d_is_fallthru(). Original-author: Valerie Aurora <vaurora@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
df4c0e36 |
|
13-Feb-2015 |
Andrey Ryabinin <ryabinin.a.a@gmail.com> |
fs: dcache: manually unpoison dname after allocation to shut up kasan's reports We need to manually unpoison rounded up allocation size for dname to avoid kasan's reports in dentry_string_cmp(). When CONFIG_DCACHE_WORD_ACCESS=y dentry_string_cmp may access few bytes beyound requested in kmalloc() size. dentry_string_cmp() relates on that fact that dentry allocated using kmalloc and kmalloc internally round up allocation size. So this is not a bug, but this makes kasan to complain about such accesses. To avoid such reports we mark rounded up allocation size in shadow as accessible. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3f97b163 |
|
12-Feb-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
list_lru: add helpers to isolate items Currently, the isolate callback passed to the list_lru_walk family of functions is supposed to just delete an item from the list upon returning LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by __list_lru_walk_one after the callback returns. Since the callback is allowed to drop the lock after removing an item (it has to return LRU_REMOVED_RETRY then), the nr_items can be less than the actual number of elements on the list even if we check them under the lock. This makes it difficult to move items from one list_lru_one to another, which is required for per-memcg list_lru reparenting - we can't just splice the lists, we have to move entries one by one. This patch therefore introduces helpers that must be used by callback functions to isolate items instead of raw list_del/list_move. These are list_lru_isolate and list_lru_isolate_move. They not only remove the entry from the list, but also fix the nr_items counter, making sure nr_items always reflects the actual number of elements on the list if checked under the appropriate lock. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
503c358c |
|
12-Feb-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
list_lru: introduce list_lru_shrink_{count,walk} Kmem accounting of memcg is unusable now, because it lacks slab shrinker support. That means when we hit the limit we will get ENOMEM w/o any chance to recover. What we should do then is to call shrink_slab, which would reclaim old inode/dentry caches from this cgroup. This is what this patch set is intended to do. Basically, it does two things. First, it introduces the notion of per-memcg slab shrinker. A shrinker that wants to reclaim objects per cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be passed the memory cgroup to scan from in shrink_control->memcg. For such shrinkers shrink_slab iterates over the whole cgroup subtree under the target cgroup and calls the shrinker for each kmem-active memory cgroup. Secondly, this patch set makes the list_lru structure per-memcg. It's done transparently to list_lru users - everything they have to do is to tell list_lru_init that they want memcg-aware list_lru. Then the list_lru will automatically distribute objects among per-memcg lists basing on which cgroup the object is accounted to. This way to make FS shrinkers (icache, dcache) memcg-aware we only need to make them use memcg-aware list_lru, and this is what this patch set does. As before, this patch set only enables per-memcg kmem reclaim when the pressure goes from memory.limit, not from memory.kmem.limit. Handling memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and it is still unclear whether we will have this knob in the unified hierarchy. This patch (of 9): NUMA aware slab shrinkers use the list_lru structure to distribute objects coming from different NUMA nodes to different lists. Whenever such a shrinker needs to count or scan objects from a particular node, it issues commands like this: count = list_lru_count_node(lru, sc->nid); freed = list_lru_walk_node(lru, sc->nid, isolate_func, isolate_arg, &sc->nr_to_scan); where sc is an instance of the shrink_control structure passed to it from vmscan. To simplify this, let's add special list_lru functions to be used by shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which consolidate the nid and nr_to_scan arguments in the shrink_control structure. This will also allow us to avoid patching shrinkers that use list_lru when we make shrink_slab() per-memcg - all we will have to do is extend the shrink_control structure to include the target memcg and make list_lru_shrink_{count,walk} handle this appropriately. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
360f5479 |
|
09-Jan-2015 |
Linus Torvalds <torvalds@linux-foundation.org> |
dcache: let the dentry count go down to zero without taking d_lock We can be more aggressive about this, if we are clever and careful. This is subtle. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d6cb125b |
|
24-Dec-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
kill d_validate() no users left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4a7795d3 |
|
19-Nov-2014 |
Yan, Zheng <zyan@redhat.com> |
vfs: fix reference leak in d_prune_aliases() In "d_prune_alias(): just lock the parent and call __dentry_kill()" the old dget + d_drop + dput has been replaced with lock_parent + __dentry_kill; unfortunately, dput() does more than just killing dentry - it also drops the reference to parent. New variant leaks that reference and needs dput(parent) after killing the child off. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
08d4f772 |
|
04-Sep-2014 |
Mikulas Patocka <mpatocka@redhat.com> |
dcache: fix kmemcheck warning in switch_names This patch fixes kmemcheck warning in switch_names. The function switch_names swaps inline names of two dentries. It swaps full arrays d_iname, no matter how many bytes are really used by the strings. Reading data beyond string ends results in kmemcheck warning. We fix the bug by marking both arrays as fully initialized. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v3.15 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b5ae6b15 |
|
12-Oct-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
merge d_materialise_unique() into d_splice_alias() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
427c77d4 |
|
12-Oct-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
d_add_ci() should just accept a hashed exact match if it finds one Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ca5358ef |
|
26-Oct-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
deal with deadlock in d_walk() ... by not hitting rename_retry for reasons other than rename having happened. In other words, do _not_ restart when finding that between unlocking the child and locking the parent the former got into __dentry_kill(). Skip the killed siblings instead... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
946e51f2 |
|
26-Oct-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
move d_rcu from overlapping d_child to overlapping d_alias Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
51486b900 |
|
23-Oct-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
fix inode leaks on d_splice_alias() failure exits d_splice_alias() callers expect it to either stash the inode reference into a new alias, or drop the inode reference. That makes it possible to just return d_splice_alias() result from ->lookup() instance, without any extra housekeeping required. Unfortunately, that should include the failure exits. If d_splice_alias() returns an error, it leaves the dentry it has been given negative and thus it *must* drop the inode reference. Easily fixed, but it goes way back and will need backporting. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
810bb172 |
|
11-Oct-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
take dname_external() into fs/dcache.c never used outside and it's too low-level for legitimate uses outside of fs/dcache.c anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b8314f93 |
|
10-Aug-2014 |
Daeseok Youn <daeseok.youn@gmail.com> |
dcache: Fix no spaces at the start of a line in dcache.c Fixed coding style in dcache.c Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
29266201 |
|
30-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
dcache.c: call ->d_prune() regardless of d_unhashed() the only in-tree instance checks d_unhashed() anyway, out-of-tree code can preserve the current behaviour by adding such check if they want it and we get an ability to use it in cases where we *want* to be notified of killing being inevitable before ->d_lock is dropped, whether it's unhashed or not. In particular, autofs would benefit from that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
29355c39 |
|
30-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
d_prune_alias(): just lock the parent and call __dentry_kill() The only reason for games with ->d_prune() was __d_drop(), which was needed only to force dput() into killing the sucker off. Note that lock_parent() can be called under ->i_lock and won't drop it, so dentry is safe from somebody managing to kill it under us - it won't happen while we are holding ->i_lock. __dentry_kill() is called only with ->d_lockref.count being 0 (here and when picked from shrink list) or 1 (dput() and dropping the ancestors in shrink_dentry_list()), so it will never be called twice - the first thing it's doing is making ->d_lockref.count negative and once that happens, nothing will increment it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5542aa2f |
|
13-Feb-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
vfs: Make d_invalidate return void Now that d_invalidate can no longer fail, stop returning a useless return code. For the few callers that checked the return code update remove the handling of d_invalidate failure. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1ffe46d1 |
|
13-Feb-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
vfs: Merge check_submounts_and_drop and d_invalidate Now that d_invalidate is the only caller of check_submounts_and_drop, expand check_submounts_and_drop inline in d_invalidate. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8ed936b5 |
|
01-Oct-2013 |
Eric W. Biederman <ebiederman@twitter.com> |
vfs: Lazily remove mounts on unlinked files and directories. With the introduction of mount namespaces and bind mounts it became possible to access files and directories that on some paths are mount points but are not mount points on other paths. It is very confusing when rm -rf somedir returns -EBUSY simply because somedir is mounted somewhere else. With the addition of user namespaces allowing unprivileged mounts this condition has gone from annoying to allowing a DOS attack on other users in the system. The possibility for mischief is removed by updating the vfs to support rename, unlink and rmdir on a dentry that is a mountpoint and by lazily unmounting mountpoints on deleted dentries. In particular this change allows rename, unlink and rmdir system calls on a dentry without a mountpoint in the current mount namespace to succeed, and it allows rename, unlink, and rmdir performed on a distributed filesystem to update the vfs cache even if when there is a mount in some namespace on the original dentry. There are two common patterns of maintaining mounts: Mounts on trusted paths with the parent directory of the mount point and all ancestory directories up to / owned by root and modifiable only by root (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu, cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories maintained by fusermount. In the case of mounts in trusted directories owned by root and modifiable only by root the current parent directory permissions are sufficient to ensure a mount point on a trusted path is not removed or renamed by anyone other than root, even if there is a context where the there are no mount points to prevent this. In the case of mounts in directories owned by less privileged users races with users modifying the path of a mount point are already a danger. fusermount already uses a combination of chdir, /proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The removable of global rename, unlink, and rmdir protection really adds nothing new to consider only a widening of the attack window, and fusermount is already safe against unprivileged users modifying the directory simultaneously. In principle for perfect userspace programs returning -EBUSY for unlink, rmdir, and rename of dentires that have mounts in the local namespace is actually unnecessary. Unfortunately not all userspace programs are perfect so retaining -EBUSY for unlink, rmdir and rename of dentries that have mounts in the current mount namespace plays an important role of maintaining consistency with historical behavior and making imperfect userspace applications hard to exploit. v2: Remove spurious old_dentry. v3: Optimized shrink_submounts_and_drop Removed unsued afs label v4: Simplified the changes to check_submounts_and_drop Do not rename check_submounts_and_drop shrink_submounts_and_drop Document what why we need atomicity in check_submounts_and_drop Rely on the parent inode mutex to make d_revalidate and d_invalidate an atomic unit. v5: Refcount the mountpoint to detach in case of simultaneous renames. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
bafc9b75 |
|
13-Feb-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
vfs: More precise tests in d_invalidate The current comments in d_invalidate about what and why it is doing what it is doing are wildly off-base. Which is not surprising as the comments date back to last minute bug fix of the 2.2 kernel. The big fat lie of a comment said: If it's a directory, we can't drop it for fear of somebody re-populating it with children (even though dropping it would make it unreachable from that root, we still might repopulate it if it was a working directory or similar). [AV] What we really need to avoid is multiple dentry aliases of the same directory inode; on all filesystems that have ->d_revalidate() we either declare all positive dentries always valid (and thus never fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(), which take care of alias prevention. The current rules are: - To prevent mount point leaks dentries that are mount points or that have childrent that are mount points may not be be unhashed. - All dentries may be unhashed. - Directories may be rehashed with d_materialise_unique check_submounts_and_drop implements this already for well maintained remote filesystems so implement the current rules in d_invalidate by just calling check_submounts_and_drop. The one difference between d_invalidate and check_submounts_and_drop is that d_invalidate must respect it when a d_revalidate method has earlier called d_drop so preserve the d_unhashed check in d_invalidate. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3ccb354d |
|
12-Feb-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
vfs: Document the effect of d_revalidate on d_find_alias d_drop or check_submounts_and_drop called from d_revalidate can result in renamed directories with child dentries being unhashed. These renamed and drop directory dentries can be rehashed after d_materialise_unique uses d_find_alias to find them. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8d85b484 |
|
29-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
Allow sharing external names after __d_move() * external dentry names get a small structure prepended to them (struct external_name). * it contains an atomic refcount, matching the number of struct dentry instances that have ->d_name.name pointing to that external name. The first thing free_dentry() does is decrementing refcount of external name, so the instances that are between the call of free_dentry() and RCU-delayed actual freeing do not contribute. * __d_move(x, y, false) makes the name of x equal to the name of y, external or not. If y has an external name, extra reference is grabbed and put into x->d_name.name. If x used to have an external name, the reference to the old name is dropped and, should it reach zero, freeing is scheduled via kfree_rcu(). * free_dentry() in dentry with external name decrements the refcount of that name and, should it reach zero, does RCU-delayed call that will free both the dentry and external name. Otherwise it does what it used to do, except that __d_free() doesn't even look at ->d_name.name; it simply frees the dentry. All non-RCU accesses to dentry external name are safe wrt freeing since they all should happen before free_dentry() is called. RCU accesses might run into a dentry seen by free_dentry() or into an old name that got already dropped by __d_move(); however, in both cases dentry must have been alive and refer to that name at some point after we'd done rcu_read_lock(), which means that any freeing must be still pending. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6d13f694 |
|
29-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
missing data dependency barrier in prepend_name() AFAICS, prepend_name() is broken on SMP alpha. Disclaimer: I don't have SMP alpha boxen to reproduce it on. However, it really looks like the race is real. CPU1: d_path() on /mnt/ramfs/<255-character>/foo CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character> CPU2 does d_alloc(), which allocates an external name, stores the name there including terminating NUL, does smp_wmb() and stores its address in dentry->d_name.name. It proceeds to d_add(dentry, NULL) and d_move() old dentry over to that. ->d_name.name value ends up in that dentry. In the meanwhile, CPU1 gets to prepend_name() for that dentry. It fetches ->d_name.name and ->d_name.len; the former ends up pointing to new name (64-byte kmalloc'ed array), the latter - 255 (length of the old name). Nothing to force the ordering there, and normally that would be OK, since we'd run into the terminating NUL and stop. Except that it's alpha, and we'd need a data dependency barrier to guarantee that we see that store of NUL __d_alloc() has done. In a similar situation dentry_cmp() would survive; it does explicit smp_read_barrier_depends() after fetching ->d_name.name. prepend_name() doesn't and it risks walking past the end of kmalloc'ed object and possibly oops due to taking a page fault in kernel mode. Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d2fa4a84 |
|
24-Sep-2014 |
Mikhail Efremov <sem@altlinux.org> |
vfs: Don't exchange "short" filenames unconditionally. Only exchange source and destination filenames if flags contain RENAME_EXCHANGE. In case if executable file was running and replaced by other file /proc/PID/exe should still show correct file name, not the old name of the file by which it was replaced. The scenario when this bug manifests itself was like this: * ALT Linux uses rpm and start-stop-daemon; * during a package upgrade rpm creates a temporary file for an executable to rename it upon successful unpacking; * start-stop-daemon is run subsequently and it obtains the (nonexistant) temporary filename via /proc/PID/exe thus failing to identify the running process. Note that "long" filenames (> DNAiME_INLINE_LEN) are still exchanged without RENAME_EXCHANGE and this behaviour exists long enough (should be fixed too apparently). So this patch is just an interim workaround that restores behavior for "short" names as it was before changes introduced by commit da1ce0670c14 ("vfs: add cross-rename"). See https://lkml.org/lkml/2014/9/7/6 for details. AV: the comments about being more careful with ->d_name.hash than with ->d_name.name are from back in 2.3.40s; they became obsolete by 2.3.60s, when we started to unhash the target instead of swapping hash chain positions followed by d_delete() as we used to do when dcache was first introduced. Acked-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: da1ce0670c14 "vfs: add cross-rename" Signed-off-by: Mikhail Efremov <sem@altlinux.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a28ddb87 |
|
24-Sep-2014 |
Linus Torvalds <torvalds@linux-foundation.org> |
fold swapping ->d_name.hash into switch_names() and do it along with ->d_name.len there Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
986c0194 |
|
26-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
fold unlocking the children into dentry_unlock_parents_for_move() ... renaming it into dentry_unlock_for_move() and making it more symmetric with dentry_lock_for_move(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
63cf427a |
|
26-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
kill __d_materialise_dentry() it folds into __d_move() now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4453641f |
|
26-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
__d_materialise_dentry(): flip the order of arguments ... thus making it much closer to (now unreachable, BTW) IS_ROOT(dentry) case in __d_move(). A bit more and it'll fold in. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9d8cd306 |
|
26-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
__d_move(): fold manipulations with ->d_child/->d_subdirs list_del() + list_add() is a slightly pessimised list_move() list_del() + INIT_LIST_HEAD() is a slightly pessimised list_del_init() Interleaving those makes the resulting code even worse. And harder to follow... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8527dd71 |
|
26-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
don't open-code d_rehash() in d_materialise_unique() ... and get rid of duplicate BUG_ON() there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5cc3821b |
|
26-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
pull rehashing and unlocking the target dentry into __d_materialise_dentry() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6f18493e |
|
11-Sep-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) and lock the right list there Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
99d263d4 |
|
13-Sep-2014 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: fix bad hashing of dentries Josef Bacik found a performance regression between 3.2 and 3.10 and narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for dcache name comparison and hashing"). He reports: "The test case is essentially for (i = 0; i < 1000000; i++) mkdir("a$i"); On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k dir/sec with 3.10. This is because we spend waaaaay more time in __d_lookup on 3.10 than in 3.2. The new hashing function for strings is suboptimal for < sizeof(unsigned long) string names (and hell even > sizeof(unsigned long) string names that I've tested). I broke out the old hashing function and the new one into a userspace helper to get real numbers and this is what I'm getting: Old hash table had 1000000 entries, 0 dupes, 0 max dupes New hash table had 12628 entries, 987372 dupes, 900 max dupes We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash My test does the hash, and then does the d_hash into a integer pointer array the same size as the dentry hash table on my system, and then just increments the value at the address we got to see how many entries we overlap with. As you can see the old hash function ended up with all 1 million entries in their own bucket, whereas the new one they are only distributed among ~12.5k buckets, which is why we're using so much more CPU in __d_lookup". The reason for this hash regression is two-fold: - On 64-bit architectures the down-mixing of the original 64-bit word-at-a-time hash into the final 32-bit hash value is very simplistic and suboptimal, and just adds the two 32-bit parts together. In particular, because there is no bit shuffling and the mixing boundary is also a byte boundary, similar character patterns in the low and high word easily end up just canceling each other out. - the old byte-at-a-time hash mixed each byte into the final hash as it hashed the path component name, resulting in the low bits of the hash generally being a good source of hash data. That is not true for the word-at-a-time case, and the hash data is distributed among all the bits. The fix is the same in both cases: do a better job of mixing the bits up and using as much of the hash data as possible. We already have the "hash_32|64()" functions to do that. Reported-by: Josef Bacik <jbacik@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
49c7dd28 |
|
31-Jul-2014 |
Fengguang Wu <fengguang.wu@intel.com> |
fs: mark __d_obtain_alias static Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
95ad5c29 |
|
11-Mar-2014 |
J. Bruce Fields <bfields@redhat.com> |
dcache: d_splice_alias should detect loops I believe this can only happen in the case of a corrupted filesystem. So -EIO looks like the appropriate error. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8d80d7da |
|
16-Jan-2014 |
J. Bruce Fields <bfields@redhat.com> |
dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED If we get to this point and discover the dentry is not a root dentry, or not DCACHE_DISCONNECTED--great, we always prefer that anyway. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
52ed46f0 |
|
16-Jan-2014 |
J. Bruce Fields <bfields@redhat.com> |
dcache: remove unused d_find_alias parameter Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1a0a397e |
|
14-Feb-2014 |
J. Bruce Fields <bfields@redhat.com> |
dcache: d_obtain_alias callers don't all want DISCONNECTED There are a few d_obtain_alias callers that are using it to get the root of a filesystem which may already have an alias somewhere else. This is not the same as the filehandle-lookup case, and none of them actually need DCACHE_DISCONNECTED set. It isn't really a serious problem, but it would really be clearer if we reserved DCACHE_DISCONNECTED for those cases where it's actually needed. In the btrfs case this was causing a spurious printk from nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol", and this replaces that workaround. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
da093a9b |
|
17-Feb-2014 |
J. Bruce Fields <bfields@redhat.com> |
dcache: d_splice_alias should ignore DCACHE_DISCONNECTED Any IS_ROOT() alias should be safe to use; there's nothing special about DCACHE_DISCONNECTED dentries. Note that this is in fact useful for filesystems such as btrfs which can legimately encounter a directory with a preexisting IS_ROOT alias on a lookup that crosses into a subvolume. (Those aliases are currently marked DCACHE_DISCONNECTED--but not really for any good reason, and we'll change that soon.) Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
908790fa |
|
17-Feb-2014 |
J. Bruce Fields <bfields@redhat.com> |
dcache: d_splice_alias mustn't create directory aliases Currently if d_splice_alias finds a directory with an alias that is not IS_ROOT or not DCACHE_DISCONNECTED, it creates a duplicate directory. Duplicate directory dentries are unacceptable; it is better just to error out. (In the case of a local filesystem the most likely case is filesystem corruption: for example, perhaps two directories point to the same child directory, and the other parent has already been found and cached.) Note that distributed filesystems may encounter this case in normal operation if a remote host moves a directory to a location different from the one we last cached in the dcache. For that reason, such filesystems should instead use d_materialise_unique, which tries to move the old directory alias to the right place instead of erroring out. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
75a2352d |
|
17-Feb-2014 |
J. Bruce Fields <bfields@redhat.com> |
dcache: close d_move race in d_splice_alias d_splice_alias will d_move an IS_ROOT() directory dentry into place if one exists. This should be safe as long as the dentry remains IS_ROOT, but I can't see what guarantees that: once we drop the i_lock all we hold here is the i_mutex on an unrelated parent directory. Instead copy the logic of d_materialise_unique. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3f70bd51 |
|
18-Feb-2014 |
J. Bruce Fields <bfields@redhat.com> |
dcache: move d_splice_alias Just a trivial move to locate it near (similar) d_materialise_unique code and save some forward references in a following patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c2338f2d |
|
11-Jun-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
lock_parent: don't step on stale ->d_parent of all-but-freed one Dentry that had been through (or into) __dentry_kill() might be seen by shrink_dentry_list(); that's normal, it'll be taken off the shrink list and freed if __dentry_kill() has already finished. The problem is, its ->d_parent might be pointing to already freed dentry, so lock_parent() needs to be careful. We need to check that dentry hasn't already gone into __dentry_kill() *and* grab rcu_read_lock() before dropping ->d_lock - the latter makes sure that whatever we see in ->d_parent after dropping ->d_lock it won't be freed until we drop rcu_read_lock(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1f7e0616 |
|
06-Jun-2014 |
Joe Perches <joe@perches.com> |
fs: convert use of typedef ctl_table to struct ctl_table This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9f12600f |
|
31-May-2014 |
Linus Torvalds <torvalds@linux-foundation.org> |
dcache: add missing lockdep annotation lock_parent() very much on purpose does nested locking of dentries, and is careful to maintain the right order (lock parent first). But because it didn't annotate the nested locking order, lockdep thought it might be a deadlock on d_lock, and complained. Add the proper annotation for the inner locking of the child dentry to make lockdep happy. Introduced by commit 046b961b45f9 ("shrink_dentry_list(): take parent's ->d_lock earlier"). Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8cbf74da |
|
29-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
dentry_kill() doesn't need the second argument now it's 1 in the only remaining caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b2b80195 |
|
29-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
dealing with the rest of shrink_dentry_list() livelock We have the same problem with ->d_lock order in the inner loop, where we are dropping references to ancestors. Same solution, basically - instead of using dentry_kill() we use lock_parent() (introduced in the previous commit) to get that lock in a safe way, recheck ->d_count (in case if lock_parent() has ended up dropping and retaking ->d_lock and somebody managed to grab a reference during that window), trylock the inode->i_lock and use __dentry_kill() to do the rest. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
046b961b |
|
29-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
shrink_dentry_list(): take parent's ->d_lock earlier The cause of livelocks there is that we are taking ->d_lock on dentry and its parent in the wrong order, forcing us to use trylock on the parent's one. d_walk() takes them in the right order, and unfortunately it's not hard to create a situation when shrink_dentry_list() can't make progress since trylock keeps failing, and shrink_dcache_parent() or check_submounts_and_drop() keeps calling d_walk() disrupting the very shrink_dentry_list() it's waiting for. Solution is straightforward - if that trylock fails, let's unlock the dentry itself and take locks in the right order. We need to stabilize ->d_parent without holding ->d_lock, but that's doable using RCU. And we'd better do that in the very beginning of the loop in shrink_dentry_list(), since the checks on refcount, etc. would need to be redone anyway. That deals with a half of the problem - killing dentries on the shrink list itself. Another one (dropping their parents) is in the next commit. locking parent is interesting - it would be easy to do rcu_read_lock(), lock whatever we think is a parent, lock dentry itself and check if the parent is still the right one. Except that we need to check that *before* locking the dentry, or we are risking taking ->d_lock out of order. Fortunately, once the D1 is locked, we can check if D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent can start or stop pointing to D1 only under D1->d_lock, so taking D1->d_lock is enough. In other words, the right solution is rcu_read_lock/lock what looks like parent right now/check if it's still our parent/rcu_read_unlock/lock the child. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ff2fde99 |
|
28-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
expand dentry_kill(dentry, 0) in shrink_dentry_list() Result will be massaged to saner shape in the next commits. It is ugly, no questions - the point of that one is to be a provably equivalent transformation (and it might be worth splitting a bit more). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e55fd011 |
|
28-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
split dentry_kill() ... into trylocks and everything else. The latter (actual killing) is __dentry_kill(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
64fd72e0 |
|
28-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
lift the "already marked killed" case into shrink_dentry_list() It can happen only when dentry_kill() is called with unlock_on_failure equal to 0 - other callers had dentry pinned until the moment they've got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead(). IOW, only one of three call sites of dentry_kill() might end up reaching that code. Just move it there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
60942f2f |
|
02-May-2014 |
Miklos Szeredi <mszeredi@suse.cz> |
dcache: don't need rcu in shrink_dentry_list() Since now the shrink list is private and nobody can free the dentry while it is on the shrink list, we can remove RCU protection from this. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9c8c10e2 |
|
02-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
more graceful recovery in umount_collect() Start with shrink_dcache_parent(), then scan what remains. First of all, BUG() is very much an overkill here; we are holding ->s_umount, and hitting BUG() means that a lot of interesting stuff will be hanging after that point (sync(2), for example). Moreover, in cases when there had been more than one leak, we'll be better off reporting all of them. And more than just the last component of pathname - %pd is there for just such uses... That was the last user of dentry_lru_del(), so kill it off... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fe91522a |
|
02-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
don't remove from shrink list in select_collect() If we find something already on a shrink list, just increment data->found and do nothing else. Loops in shrink_dcache_parent() and check_submounts_and_drop() will do the right thing - everything we did put into our list will be evicted and if there had been nothing, but data->found got non-zero, well, we have somebody else shrinking those guys; just try again. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
41edf278 |
|
01-May-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
dentry_kill(): don't try to remove from shrink list If the victim in on the shrink list, don't remove it from there. If shrink_dentry_list() manages to remove it from the list before we are done - fine, we'll just free it as usual. If not - mark it with new flag (DCACHE_MAY_FREE) and leave it there. Eventually, shrink_dentry_list() will get to it, remove the sucker from shrink list and call dentry_kill(dentry, 0). Which is where we'll deal with freeing. Since now dentry_kill(dentry, 0) may happen after or during dentry_kill(dentry, 1), we need to recognize that (by seeing DCACHE_DENTRY_KILLED already set), unlock everything and either free the sucker (in case DCACHE_MAY_FREE has been set) or leave it for ongoing dentry_kill(dentry, 1) to deal with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
01b60351 |
|
29-Apr-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
expand the call of dentry_lru_del() in dentry_kill() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b4f0354e |
|
29-Apr-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: dentry_free() The part of old d_free() that dealt with actual freeing of dentry. Taken out of dentry_kill() into a separate function. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5c47e6d0 |
|
29-Apr-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
fold try_prune_one_dentry() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
03b3b889 |
|
29-Apr-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
fold d_kill() and d_free() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
22213318 |
|
18-Apr-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
fix races between __d_instantiate() and checks of dentry flags in non-lazy walk we need to be careful about dentry switching from negative to positive - both ->d_flags and ->d_inode are updated, and in some places we might see only one store. The cases where dentry has been obtained by dcache lookup with ->i_mutex held on parent are safe - ->d_lock and ->i_mutex provide all the barriers we need. However, there are several places where we run into trouble: * do_last() fetches ->d_inode, then checks ->d_flags and assumes that inode won't be NULL unless d_is_negative() is true. Race with e.g. creat() - we might have fetched the old value of ->d_inode (still NULL) and new value of ->d_flags (already not DCACHE_MISS_TYPE). Lin Ming has observed and reported the resulting oops. * a bunch of places checks ->d_inode for being non-NULL, then checks ->d_flags for "is it a symlink". Race with symlink(2) in case if our CPU sees ->d_inode update first - we see non-NULL there, but ->d_flags still contains DCACHE_MISS_TYPE instead of DCACHE_SYMLINK_TYPE. Result: false negative on "should we follow link here?", with subsequent unpleasantness. Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one Reported-and-tested-by: Lin Ming <minggr@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
da1ce067 |
|
01-Apr-2014 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: add cross-rename If flags contain RENAME_EXCHANGE then exchange source and destination files. There's no restriction on the type of the files; e.g. a directory can be exchanged with a symlink. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
|
#
e825196d |
|
22-Mar-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
make prepend_name() work correctly when called with negative *buflen In all callchains leading to prepend_name(), the value left in *buflen is eventually discarded unused if prepend_name() has returned a negative. So we are free to do what prepend() does, and subtract from *buflen *before* checking for underflow (which turns into checking the sign of subtraction result, of course). Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
31bbe16f |
|
03-Jan-2014 |
David Herrmann <dh.herrmann@gmail.com> |
drm: add pseudo filesystem for shared inodes Our current DRM design uses a single address_space for all users of the same DRM device. However, there is no way to create an anonymous address_space without an underlying inode. Therefore, we wait for the first ->open() callback on a registered char-dev and take-over the inode of the char-dev. This worked well so far, but has several drawbacks: - We screw with FS internals and rely on some non-obvious invariants like inode->i_mapping being the same as inode->i_data for char-devs. - We don't have any address_space prior to the first ->open() from user-space. This leads to ugly fallback code and we cannot allocate global objects early. As pointed out by Al-Viro, fs/anon_inode.c is *not* supposed to be used by drivers for anonymous inode-allocation. Therefore, this patch follows the proposed alternative solution and adds a pseudo filesystem mount-point to DRM. We can then allocate private inodes including a private address_space for each DRM device at initialization time. Note that we could use: sysfs_get_inode(sysfs_mnt->mnt_sb, drm_device->dev->kobj.sd); to get access to the underlying sysfs-inode of a "struct device" object. However, most of this information is currently hidden and it's not clear whether this address_space is suitable for driver access. Thus, unless linux allows anonymous address_space objects or driver-core provides a public inode per device, we're left with our own private internal mount point. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
|
#
f6500801 |
|
25-Jan-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
__dentry_path() fixes * we need to save the starting point for restarts * reject pathologically short buffers outright Spotted-by: Denys Vlasenko <dvlasenk@redhat.com> Spotted-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a8323da0 |
|
20-Jan-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
vfs: Remove second variable named error in __dentry_path In commit 232d2d60aa5469bb097f55728f65146bd49c1d25 Author: Waiman Long <Waiman.Long@hp.com> Date: Mon Sep 9 12:18:13 2013 -0400 dcache: Translating dentry into pathname without taking rename_lock The __dentry_path locking was changed and the variable error was intended to be moved outside of the loop. Unfortunately the inner declaration of error was not removed. Resulting in a version of __dentry_path that will never return an error. Remove the problematic inner declaration of error and allow __dentry_path to return errors once again. Cc: stable@vger.kernel.org Cc: Waiman Long <Waiman.Long@hp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a5c21dce |
|
12-Dec-2013 |
Will Deacon <will@kernel.org> |
dcache: allow word-at-a-time name hashing with big-endian CPUs When explicitly hashing the end of a string with the word-at-a-time interface, we have to be careful which end of the word we pick up. On big-endian CPUs, the upper-bits will contain the data we're after, so ensure we generate our masks accordingly (and avoid hashing whatever random junk may have been sitting after the string). This patch adds a new dcache helper, bytemask_from_count, which creates a mask appropriate for the CPU endianness. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f48cfddc |
|
08-Nov-2013 |
Eric W. Biederman <ebiederm@xmission.com> |
vfs: In d_path don't call d_dname on a mount point Aditya Kali (adityakali@google.com) wrote: > Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > "proc: Fix the namespace inode permission checks." converted > the namespace files into symlinks. The same commit changed > the way namespace bind mounts appear in /proc/mounts: > $ mount --bind /proc/self/ns/ipc /mnt/ipc > Originally: > $ cat /proc/mounts | grep ipc > proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0 > > After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > $ cat /proc/mounts | grep ipc > proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0 > > This breaks userspace which expects the 2nd field in > /proc/mounts to be a valid path. The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to dentries allocated with d_alloc_pseudo that we can mount, and that have interesting names printed out with d_dname. When these files are bind mounted /proc/mounts is not currently displaying the mount point correctly because d_dname is called instead of just displaying the path where the file is mounted. Solve this by adding an explicit check to distinguish mounted pseudo inodes and unmounted pseudo inodes. Unmounted pseudo inodes always use mount of their filesstem as the mnt_root in their path making these two cases easy to distinguish. CC: stable@vger.kernel.org Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Reported-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
#
31dec132 |
|
25-Oct-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
fold try_to_ascend() into the sole remaining caller There used to be a bunch of tree-walkers in dcache.c, all alike. try_to_ascend() had been introduced to abstract a piece of logics duplicated in all of them. These days all these tree-walkers are implemented via the same iterator (d_walk()), which is the only remaining caller of try_to_ascend(), so let's fold it back... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
482db906 |
|
25-Oct-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
dcache.c: get rid of pointless macros D_HASH{MASK,BITS} are used once each, both in the same function (d_hash()). At this point they are actively misguiding - they imply that values are compiler constants, which is no longer true. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2bc74feb |
|
25-Oct-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
take read_seqbegin_or_lock() and friends to seqlock.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ede4cebc |
|
13-Nov-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
prepend_path() needs to reinitialize dentry/vfsmount/mnt on restarts ... and equivalent is needed in 3.12; it's broken there as well Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4ec6c2ae |
|
13-Nov-2013 |
Li Zhong <zhong@linux.vnet.ibm.com> |
fix unpaired rcu lock in prepend_path() Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f80de2cd |
|
18-Jul-2012 |
J. Bruce Fields <bfields@redhat.com> |
dcache: don't clear DCACHE_DISCONNECTED too early DCACHE_DISCONNECTED should not be cleared until we're sure the dentry is connected all the way up to the root of the filesystem. It *shouldn't* be cleared as soon as the dentry is connected to a parent. That will cause bugs at least on exportable filesystems. Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e1a24bb0 |
|
29-Jun-2012 |
J. Bruce Fields <bfields@redhat.com> |
dcache: Don't set DISCONNECTED on "pseudo filesystem" dentries I can't for the life of me see any reason why anyone should care whether a dentry that is never hooked into the dentry cache would need DCACHE_DISCONNECTED set. This originates from 4b936885ab04dc6e0bb0ef35e0e23c1a7364d9e5 "fs: improve scalability of pseudo filesystems", which probably just made the false assumption the DCACHE_DISCONNECTED was meant to be set on anything not connected to a parent somehow. So this is just confusing. Ideally the only uses of DCACHE_DISCONNECTED would be in the filehandle-lookup code, which needs it to ensure dentries are connected into the dentry tree before use. I left d_alloc_pseudo there even though it's now equivalent to __d_alloc(), just on the theory the name is better documentation of its intended use outside dcache.c. Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7632e465 |
|
27-Jun-2012 |
J. Bruce Fields <bfields@redhat.com> |
dcache: use IS_ROOT to decide where dentry is hashed Every hashed dentry is either hashed in the dentry_hashtable, or a superblock's s_anon list. __d_drop() assumes it can determine which is the case by checking DCACHE_DISCONNECTED; this is not true. It is true that when DCACHE_DISCONNECTED is cleared, the dentry is not only hashed on dentry_hashtable, but is fully connected to its parents back to the root. But the converse is *not* true: fs/exportfs/expfs.c:reconnect_path() attempts to connect a directory (found by filehandle lookup) back to root by ascending to parents and performing lookups one at a time. It does not clear DCACHE_DISCONNECTED until it's done, and that is not at all an atomic process. In particular, it is possible for DCACHE_DISCONNECTED to be set on a dentry which is hashed on the dentry_hashtable. Instead, use IS_ROOT() to check which hash chain a dentry is on. This *does* work: Dentries are hashed only by: - d_obtain_alias, which adds an IS_ROOT() dentry to sb_anon. - __d_rehash, called by _d_rehash: hashes to the dentry's parent, and all callers of _d_rehash appear to have d_parent set to a "real" parent. - __d_rehash, called by __d_move: rehashes the moved dentry to hash chain determined by target, and assigns target's d_parent to its d_parent, before dropping the dentry's d_lock. Therefore I believe it's safe for a holder of a dentry's d_lock to assume that it is hashed on sb_anon if and only if IS_ROOT(dentry) is true. I believe the incorrect assumption about DCACHE_DISCONNECTED was originally introduced by ceb5bdc2d246 "fs: dcache per-bucket dcache hash locking". Also add a comment while we're here. Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b18825a7 |
|
12-Sep-2013 |
David Howells <dhowells@redhat.com> |
VFS: Put a small type field into struct dentry::d_flags Put a type field into struct dentry::d_flags to indicate if the dentry is one of the following types that relate particularly to pathwalk: Miss (negative dentry) Directory "Automount" directory (defective - no i_op->lookup()) Symlink Other (regular, socket, fifo, device) The type field is set to one of the first five types on a dentry by calls to __d_instantiate() and d_obtain_alias() from information in the inode (if one is given). The type is cleared by dentry_unlink_inode() when it reconstitutes an existing dentry as a negative dentry. Accessors provided are: d_set_type(dentry, type) d_is_directory(dentry) d_is_autodir(dentry) d_is_symlink(dentry) d_is_file(dentry) d_is_negative(dentry) d_is_positive(dentry) A bunch of checks in pathname resolution switched to those. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b61625d2 |
|
04-Oct-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
fold __d_shrink() into its only remaining caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
48a066e7 |
|
29-Sep-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
RCU'd vfsmounts * RCU-delayed freeing of vfsmounts * vfsmount_lock replaced with a seqlock (mount_lock) * sequence number from mount_lock is stored in nameidata->m_seq and used when we exit RCU mode * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its caller knows that vfsmount will have no surviving references. * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock() and doing pending mntput(). * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence number against seq, then grabs reference to mnt. Then it rechecks mount_lock again to close the race and either returns success or drops the reference it has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can simply decrement the refcount and sod off - aforementioned synchronize_rcu() makes sure that final mntput() won't come until we leave RCU mode. We need that, since we don't want to end up with some lazy pathwalk racing with umount() and stealing the final mntput() from it - caller of umount() may expect it to return only once the fs is shut down and we don't want to break that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do full-blown mntput() in case of mount_lock sequence number mismatch happening just as we'd grabbed the reference, but in those cases we won't be stealing the final mntput() from anything that would care. * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally, SMP and UP cases are handled the same way - no ifdefs there. * normal pathname resolution does *not* do any writes to mount_lock. It does, of course, bump the refcounts of vfsmount and dentry in the very end, but that's it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
42c32608 |
|
07-Nov-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
switch shrink_dcache_for_umount() to use of d_walk() we have too many iterators in fs/dcache.c... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1ca7d67c |
|
07-Oct-2013 |
John Stultz <john.stultz@linaro.org> |
seqcount: Add lockdep functionality to seqcount/seqlock structures Currently seqlocks and seqcounts don't support lockdep. After running across a seqcount related deadlock in the timekeeping code, I used a less-refined and more focused variant of this patch to narrow down the cause of the issue. This is a first-pass attempt to properly enable lockdep functionality on seqlocks and seqcounts. Since seqcounts are used in the vdso gettimeofday code, I've provided non-lockdep accessors for those needs. I've also handled one case where there were nested seqlock writers and there may be more edge cases. Comments and feedback would be appreciated! Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/1381186321-4906-3-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
358eec18 |
|
31-Oct-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: decrapify dput(), fix cache behavior under normal load We do not want to dirty the dentry->d_flags cacheline in dput() just to set the DCACHE_REFERENCED flag when it is already set in the common case anyway. This way the first cacheline of the dentry (which contains the RCU lookup information etc) can stay shared among multiple CPU's. This finishes off some of the details of all the scalability patches merged during the merge window. Also don't mark dentry_kill() for inlining, since it's the uncommon path and inlining it just makes the common path slower due to extra function entry/exit overhead. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b70a80e7 |
|
01-Oct-2013 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: introduce d_instantiate_no_diralias() ...which just returns -EBUSY if a directory alias would be created. This is to be used by fuse mkdir to make sure that a buggy or malicious userspace filesystem doesn't do anything nasty. Previously fuse used a private mutex for this purpose, which can now go away. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
#
94e92a6e |
|
01-Oct-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
move taking vfsmount_lock down into prepend_path() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
69c88dc7 |
|
19-Oct-2013 |
Randy Dunlap <rdunlap@infradead.org> |
vfs: fix new kernel-doc warnings Move kernel-doc notation to immediately before its function to eliminate kernel-doc warnings introduced by commit db14fc3abcd5 ("vfs: add d_walk()") Warning(fs/dcache.c:1343): No description found for parameter 'data' Warning(fs/dcache.c:1343): No description found for parameter 'dentry' Warning(fs/dcache.c:1343): Excess function parameter 'parent' description in 'check_mount' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
05a8252b |
|
15-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: fix typo in comment in recent dentry work Sedat points out that I transposed some letters in "LRU" and wrote "RLU" instead in one of the new comments explaining the flow. Let's just fix it. Reported-by: Sedat Dilek <sedat.dilek@jpberlin.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
89dc77bc |
|
13-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: fix dentry LRU list handling and nr_dentry_unused accounting The LRU list changes interacted badly with our nr_dentry_unused accounting, and even worse with the new DCACHE_LRU_LIST bit logic. This introduces helper functions to make sure everything follows the proper dcache d_lru list rules: the dentry cache is complicated by the fact that some of the hotpaths don't even want to look at the LRU list at all, and the fact that we use the same list entry in the dentry for both the LRU list and for our temporary shrinking lists when removing things from the LRU. The helper functions temporarily have some extra sanity checking for the flag bits that have to match the current LRU state of the dentry. We'll remove that before the final 3.12 release, but considering how easy it is to get wrong, this first cleanup version has some very particular sanity checking. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
68f0d9d9 |
|
12-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: make d_path() get the root path under RCU This avoids the spinlocks and refcounts in the d_path() sequence too (used by /proc and various other entities). See commit 8b19e34188a3 for the equivalent getcwd() system call path. And unlike getcwd(), d_path() doesn't copy the result to user space, so I don't need to fear _that_ particular bug happening again. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3272c544 |
|
12-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: use __getname/__putname for getcwd() system call It's a pathname. It should use the pathname allocators and deallocators, and PATH_MAX instead of PAGE_SIZE. Never mind that the two are commonly the same. With this, the allocations scale up nicely too, and I can do getcwd() system calls at a rate of about 300M/s, with no lock contention anywhere. Of course, nobody sane does that, especially since getcwd() is traditionally a very slow operation in Unix. But this was also the simplest way to benchmark the prepend_path() improvements by Waiman, and once I saw the profiles I couldn't leave it well enough alone. But apart from being an performance improvement (from using per-cpu slab allocators instead of the raw page allocator), it's actually a valid and real cleanup. Signed-off-by: Linus "OCD" Torvalds <torvalds@linux-foundation.org>
|
#
ff812d72 |
|
12-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: don't copy things to user space holding the rcu readlock Oops. That wasn't very smart. We don't actually need the RCU lock any more by the time we copy the cwd string to user space, but I had stupidly surrounded the whole thing with it. Introduced by commit 8b19e34188a3 ("vfs: make getcwd() get the root and pwd path under rcu") Is-a-big-hairy-idiot: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8b19e341 |
|
12-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: make getcwd() get the root and pwd path under rcu This allows us to skip all the crazy spinlocks and reference count updates, and instead use the fs sequence read-lock to get an atomic snapshot of the root and cwd information. We might want to make the rule that "prepend_path()" is always called with the RCU lock held, but the RCU lock nests fine and this is the minimal fix. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5762482f |
|
12-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: move get_fs_root_and_pwd() to single caller Let's not pollute the include files with inline functions that are only used in a single place. Especially not if we decide we might want to change the semantics of said function to make it more efficient.. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
18129977 |
|
12-Sep-2013 |
Waiman Long <Waiman.Long@hp.com> |
dcache: get/release read lock in read_seqbegin_or_lock() & friend This patch modifies read_seqbegin_or_lock() and need_seqretry() to use newly introduced read_seqlock_excl() and read_sequnlock_excl() primitives so that they won't change the sequence number even if they fall back to take the lock. This is OK as no change to the protected data structure is being made. It will prevent one fallback to lock taking from cascading into a series of lock taking reducing performance because of the sequence number change. It will also allow other sequence readers to go forward while an exclusive reader lock is taken. This patch also updates some of the inaccurate comments in the code. Signed-off-by: Waiman Long <Waiman.Long@hp.com> To: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9b17c623 |
|
27-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
fs: convert inode and dentry shrinking to be node aware Now that the shrinker is passing a node in the scan control structure, we can pass this to the the generic LRU list code to isolate reclaim to the lists on matching nodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4e717f5c |
|
27-Aug-2013 |
Glauber Costa <glommer@gmail.com> |
list_lru: remove special case function list_lru_dispose_all. The list_lru implementation has one function, list_lru_dispose_all, with only one user (the dentry code). At first, such function appears to make sense because we are really not interested in the result of isolating each dentry separately - all of them are going away anyway. However, it's implementation is buggy in the following way: When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries marking them with DCACHE_SHRINK_LIST. However, this is done without the nlru->lock taken. The imediate result of that is that someone else may add or remove the dentry from the LRU at the same time. When list_lru_del happens in that scenario we will see an element that is not yet marked with DCACHE_SHRINK_LIST (even though it will be in the future) and obviously remove it from an lru where the element no longer is. Since list_lru_dispose_all will in effect count down nlru's nr_items and list_lru_del will do the same, this will lead to an imbalance. The solution for this would not be so simple: we can obviously just keep the lru_lock taken, but then we have no guarantees that we will be able to acquire the dentry lock (dentry->d_lock). To properly solve this, we need a communication mechanism between the lru and dentry code, so they can coordinate this with each other. Such mechanism already exists in the form of the list_lru_walk_cb callback. So it is possible to construct a dcache-side prune function that does the right thing only by calling list_lru_walk in a loop until no more dentries are available. With only one user, plus the fact that a sane solution for the problem would involve boucing between dcache and list_lru anyway, I see little justification to keep the special case list_lru_dispose_all in tree. Signed-off-by: Glauber Costa <glommer@openvz.org> Cc: Michal Hocko <mhocko@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f6041567 |
|
27-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
dcache: convert to use new lru list infrastructure [glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0a234c6d |
|
27-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
shrinker: convert superblock shrinkers to new API Convert superblock shrinker to use the new count/scan API, and propagate the API changes through to the filesystem callouts. The filesystem callouts already use a count/scan API, so it's just changing counters to longs to match the VM API. This requires the dentry and inode shrinker callouts to be converted to the count/scan API. This is mainly a mechanical change. [glommer@openvz.org: use mult_frac for fractional proportions, build fixes] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
dd1f6b2e |
|
27-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
dcache: remove dentries from LRU before putting on dispose list One of the big problems with modifying the way the dcache shrinker and LRU implementation works is that the LRU is abused in several ways. One of these is shrink_dentry_list(). Basically, we can move a dentry off the LRU onto a different list without doing any accounting changes, and then use dentry_lru_prune() to remove it from what-ever list it is now on to do the LRU accounting at that point. This makes it -really hard- to change the LRU implementation. The use of the per-sb LRU lock serialises movement of the dentries between the different lists and the removal of them, and this is the only reason that it works. If we want to break up the dentry LRU lock and lists into, say, per-node lists, we remove the only serialisation that allows this lru list/dispose list abuse to work. To make this work effectively, the dispose list has to be isolated from the LRU list - dentries have to be removed from the LRU *before* being placed on the dispose list. This means that the LRU accounting and isolation is completed before disposal is started, and that means we can change the LRU implementation freely in future. This means that dentries *must* be marked with DCACHE_SHRINK_LIST when they are placed on the dispose list so that we don't think that parent dentries found in try_prune_one_dentry() are on the LRU when the are actually on the dispose list. This would result in accounting the dentry to the LRU a second time. Hence dentry_lru_del() has to handle the DCACHE_SHRINK_LIST case Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
19156840 |
|
27-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
dentry: move to per-sb LRU locks With the dentry LRUs being per-sb structures, there is no real need for a global dentry_lru_lock. The locking can be made more fine-grained by moving to a per-sb LRU lock, isolating the LRU operations of different filesytsems completely from each other. The need for this is independent of any performance consideration that may arise: in the interest of abstracting the lru operations away, it is mandatory that each lru works around its own lock instead of a global lock for all of them. [glommer@openvz.org: updated changelog ] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
62d36c77 |
|
27-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
dcache: convert dentry_stat.nr_unused to per-cpu counters Before we split up the dcache_lru_lock, the unused dentry counter needs to be made independent of the global dcache_lru_lock. Convert it to per-cpu counters to do this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3942c07c |
|
27-Aug-2013 |
Glauber Costa <glommer@openvz.org> |
fs: bump inode and dentry counters to long This series reworks our current object cache shrinking infrastructure in two main ways: * Noticing that a lot of users copy and paste their own version of LRU lists for objects, we put some effort in providing a generic version. It is modeled after the filesystem users: dentries, inodes, and xfs (for various tasks), but we expect that other users could benefit in the near future with little or no modification. Let us know if you have any issues. * The underlying list_lru being proposed automatically and transparently keeps the elements in per-node lists, and is able to manipulate the node lists individually. Given this infrastructure, we are able to modify the up-to-now hammer called shrink_slab to proceed with node-reclaim instead of always searching memory from all over like it has been doing. Per-node lru lists are also expected to lead to less contention in the lru locks on multi-node scans, since we are now no longer fighting for a global lock. The locks usually disappear from the profilers with this change. Although we have no official benchmarks for this version - be our guest to independently evaluate this - earlier versions of this series were performance tested (details at http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no visible performance regressions while yielding a better qualitative behavior in NUMA machines. With this infrastructure in place, we can use the list_lru entry point to provide memcg isolation and per-memcg targeted reclaim. Historically, those two pieces of work have been posted together. This version presents only the infrastructure work, deferring the memcg work for a later time, so we can focus on getting this part tested. You can see more about the history of such work at http://lwn.net/Articles/552769/ Dave Chinner (18): dcache: convert dentry_stat.nr_unused to per-cpu counters dentry: move to per-sb LRU locks dcache: remove dentries from LRU before putting on dispose list mm: new shrinker API shrinker: convert superblock shrinkers to new API list: add a new LRU list type inode: convert inode lru list to generic lru list code. dcache: convert to use new lru list infrastructure list_lru: per-node list infrastructure shrinker: add node awareness fs: convert inode and dentry shrinking to be node aware xfs: convert buftarg LRU to generic code xfs: rework buffer dispose list tracking xfs: convert dquot cache lru to list_lru fs: convert fs shrinkers to new scan/count API drivers: convert shrinkers to new count/scan API shrinker: convert remaining shrinkers to count/scan API shrinker: Kill old ->shrink API. Glauber Costa (7): fs: bump inode and dentry counters to long super: fix calculation of shrinkable objects for small numbers list_lru: per-node API vmscan: per-node deferred work i915: bail out earlier when shrinker cannot acquire mutex hugepage: convert huge zero page shrinker to new shrinker API list_lru: dynamically adjust node arrays This patch: There are situations in very large machines in which we can have a large quantity of dirty inodes, unused dentries, etc. This is particularly true when umounting a filesystem, where eventually since every live object will eventually be discarded. Dave Chinner reported a problem with this while experimenting with the shrinker revamp patchset. So we believe it is time for a change. This patch just moves int to longs. Machines where it matters should have a big long anyway. Signed-off-by: Glauber Costa <glommer@openvz.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Chinner <dchinner@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
48f5ec21 |
|
09-Sep-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
split read_seqretry_or_unlock(), convert d_walk() to resulting primitives Separate "check if we need to retry" from "unlock if we are done and had seq_writelock"; that allows to use these guys in d_walk(), where we need to recheck every time we ascend back to parent, but do *not* want to unlock until the very end. Lift rcu_read_lock/rcu_read_unlock out into callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
232d2d60 |
|
08-Sep-2013 |
Waiman Long <Waiman.Long@hp.com> |
dcache: Translating dentry into pathname without taking rename_lock When running the AIM7's short workload, Linus' lockref patch eliminated most of the spinlock contention. However, there were still some left: 8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock |--42.21%-- d_path | proc_pid_readlink | SyS_readlinkat | SyS_readlink | system_call | __GI___readlink | |--40.97%-- sys_getcwd | system_call | __getcwd The big one here is the rename_lock (seqlock) contention in d_path() and the getcwd system call. This patch will eliminate the need to take the rename_lock while translating dentries into the full pathnames. The need to take the rename_lock is to make sure that no rename operation can be ongoing while the translation is in progress. However, only one thread can take the rename_lock thus blocking all the other threads that need it even though the translation process won't make any change to the dentries. This patch will replace the writer's write_seqlock/write_sequnlock sequence of the rename_lock of the callers of the prepend_path() and __dentry_path() functions with the reader's read_seqbegin/read_seqretry sequence within these 2 functions. As a result, the code will have to retry if one or more rename operations had been performed. In addition, RCU read lock will be taken during the translation process to make sure that no dentries will go away. To prevent live-lock from happening, the code will switch back to take the rename_lock if read_seqretry() fails for three times. To further reduce spinlock contention, this patch does not take the dentry's d_lock when copying the filename from the dentries. Instead, it treats the name pointer and length as unreliable and just copy the string byte-by-byte over until it hits a null byte or the end of string as specified by the length. This should avoid stepping into invalid memory address. The error cases are left to be handled by the sequence number check. The following code re-factoring are also made: 1. Move prepend('/') into prepend_name() to remove one conditional check. 2. Move the global root check in prepend_path() back to the top of the while loop. With this patch, the _raw_spin_lock will now account for only 1.2% of the total CPU cycles for the short workload. This patch also has the effect of reducing the effect of running perf on its profile since the perf command itself can be a heavy user of the d_path() function depending on the complexity of the workload. When taking the perf profile of the high-systime workload, the amount of spinlock contention contributed by running perf without this patch was about 16%. With this patch, the spinlock contention caused by the running of perf will go away and we will have a more accurate perf profile. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0d98439e |
|
08-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: use lockred "dead" flag to mark unrecoverably dead dentries This simplifies the RCU to refcounting code in particular. I was originally intending to leave this for later, but walking through all the dput() logic (see previous commit), I realized that the dput() "might_sleep()" check was misleadingly weak. And I removed it as misleading, both for performance profiling and for debugging. However, the might_sleep() debugging case is actually true: the final dput() can indeed sleep, if the inode of the dentry that you are releasing ends up sleeping at iput time (see dentry_iput()). So the problem with the might_sleep() in dput() wasn't that it wasn't true, it was that it wasn't actually testing and triggering on the interesting case. In particular, just about *any* dput() can indeed sleep, if you happen to race with another thread deleting the file in question, and you then lose the race to the be the last dput() for that file. But because it's a very rare race, the debugging code would never trigger it in practice. Why is this problematic? The new d_rcu_to_refcount() (see commit 15570086b590: "vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()") does a dput() for the failure case, and it does it under the RCU lock. So potentially sleeping really is a bug. But there's no way I'm going to fix this with the previous complicated "lockref_get_or_lock()" interface. And rather than revert to the old and crufty nested dentry locking code (which did get this right by delaying the reference count updates until they were verified to be safe), let's make forward progress. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8aab6a27 |
|
08-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: reorganize dput() memory accesses This is me being a bit OCD after all the dentry optimization work this merge window: profiles end up showing 'dput()' as a rather expensive operation, and there were two unrelated bad reasons for that. The first reason was reading d_lockref.count for debugging purposes, which touches the lockref cacheline (for reads) before really need to. More importantly, the debugging test in question is _wrong_, and has hidden bugs. It's true that we can only sleep when the count goes down to zero, but the test as-is hides the much more subtle bug that happens if we race with somebody else deleting the file. Anyway we _will_ touch that cacheline, but let's do it for a write and in the right routine (ie in "lockref_put_or_lock()") which annotates the costs better. So remove the misleading debug code. The other was an unnecessary access to the cacheline that contains the d_lru list, just to check whether we already were on the LRU list or not. This is exactly what we have d_flags for, so that we can avoid touching extra cache lines for the common case. So just add another bit for "is this dentry on the LRU". Finally, mark the tests properly likely/unlikely, so that the common fast-paths are dense in the instruction stream. This makes the profiles look much saner. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
eed81007 |
|
05-Sep-2013 |
Miklos Szeredi <miklos@szeredi.hu> |
vfs: check unlinked ancestors before mount We check submounts before doing d_drop() on a non-empty directory dentry in NFS (have_submounts()), but we do not exclude a racing mount. Nor do we prevent mounts to be added to the disconnected subtree using relative paths after the d_drop(). This patch fixes these issues by checking for unlinked (unhashed, non-root) ancestors before proceeding with the mount. This is done with rename seqlock taken for write and with ->d_lock grabbed on each ancestor in turn, including our dentry itself. This ensures that the only one of check_submounts_and_drop() or has_unlinked_ancestor() can succeed. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
848ac114 |
|
05-Sep-2013 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: check submounts and drop atomically We check submounts before doing d_drop() on a non-empty directory dentry in NFS (have_submounts()), but we do not exclude a racing mount. Process A: have_submounts() -> returns false Process B: mount() -> success Process A: d_drop() This patch prepares the ground for the fix by doing the following operations all under the same rename lock: have_submounts() shrink_dcache_parent() d_drop() This is actually an optimization since have_submounts() and shrink_dcache_parent() both traverse the same dentry tree separately. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: David Howells <dhowells@redhat.com> CC: Steven Whitehouse <swhiteho@redhat.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
db14fc3a |
|
05-Sep-2013 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: add d_walk() This one replaces three instances open coded tree walking (have_submounts, select_parent, d_genocide) with a common helper. In addition to slightly reducing the kernel size, this simplifies the callers and makes them less bug prone. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
01ddc4ed |
|
05-Sep-2013 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: restructure d_genocide() It shouldn't matter when we decrement the refcount during the walk as long as we do it exactly once. Restructure d_genocide() to do the killing on entering the dentry instead of when leaving it. This helps creating a common helper for tree walking. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
590fb51f |
|
13-Aug-2013 |
Yan, Zheng <zheng.z.yan@intel.com> |
vfs: call d_op->d_prune() before unhashing dentry The d_prune dentry operation is used to notify filesystem when VFS about to prune a hashed dentry from the dcache. There are three code paths that prune dentries: shrink_dcache_for_umount_subtree(), prune_dcache_sb() and d_prune_aliases(). For the d_prune_aliases() case, VFS unhashes the dentry first, then call the d_prune dentry operation. This confuses ceph_d_prune() (ceph uses the d_prune dentry operation to maintain a flag indicating whether the complete contents of a directory are in the dcache, pruning unhashed dentry does not affect dir's completeness) This patch fixes the issue by calling the d_prune dentry operation in d_prune_aliases(), before unhashing the dentry. Also make VFS only call the d_prune dentry operation for hashed dentry, to avoid calling the d_prune dentry operation twice when dentry is pruned by d_prune_aliases(). Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
15570086 |
|
02-Sep-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c and re-implements it using the lockref infrastructure instead. It also adds a lot of comments about what is actually going on, because turning a dentry that was looked up using RCU into a long-lived reference counted entry is one of the more subtle parts of the rcu walk. We also used to be _particularly_ subtle in unlazy_walk() where we re-validate both the dentry and its parent using the same sequence count. We used to do it by nesting the locks and then verifying the sequence count just once. That was silly, because nested locking is expensive, but the sequence count check is not. So this just re-validates the dentry and the parent separately, avoiding the nested locking, and making the lockref lookup possible. Acked-by: Waiman Long <waiman.long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
df3d0bbc |
|
02-Sep-2013 |
Waiman Long <Waiman.Long@hp.com> |
vfs: use lockref_get_not_zero() for optimistic lockless dget_parent() A valid parent pointer is always going to have a non-zero reference count, but if we look up the parent optimistically without locking, we have to protect against the (very unlikely) race against renaming changing the parent from under us. We do that by using lockref_get_not_zero(), and then re-checking the parent pointer after getting a valid reference. [ This is a re-implementation of a chunk from the original patch by Waiman Long: "dcache: Enable lockless update of dentry's refcount". I've completely rewritten the patch-series and split it up, but I'm attributing this part to Waiman as it's close enough to his earlier patch - Linus ] Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
98474236 |
|
28-Aug-2013 |
Waiman Long <Waiman.Long@hp.com> |
vfs: make the dentry cache use the lockref infrastructure This just replaces the dentry count/lock combination with the lockref structure that contains both a count and a spinlock, and does the mechanical conversion to use the lockref infrastructure. There are no semantic changes here, it's purely syntactic. The reference lockref implementation uses the spinlock exactly the same way that the old dcache code did, and the bulk of this patch is just expanding the internal "d_count" use in the dcache code to use "d_lockref.count" instead. This is purely preparation for the real change to make the reference count updates be lockless during the 3.12 merge window. [ As with the previous commit, this is a rewritten version of a concept originally from Waiman, so credit goes to him, blame for any errors goes to me. Waiman's patch had some semantic differences for taking advantage of the lockless update in dget_parent(), while this patch is intentionally a pure search-and-replace change with no semantic changes. - Linus ] Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
118b2302 |
|
23-Aug-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
cope with potentially long ->d_dname() output for shmem/hugetlb dynamic_dname() is both too much and too little for those - the output may be well in excess of 64 bytes dynamic_dname() assumes to be enough (thanks to ashmem feeding really long names to shmem_file_setup()) and vsnprintf() is an overkill for those guys. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
da53be12 |
|
21-May-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
Don't pass inode to ->d_hash() and ->d_compare() Instances either don't look at it at all (the majority of cases) or only want it to find the superblock (which can be had as dentry->d_sb). A few cases that want more are actually safe with dentry->d_inode - the only precaution needed is the check that it hadn't been replaced with NULL by rmdir() or by overwriting rename(), which case should be simply treated as cache miss. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0b3fca1f |
|
15-Jun-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
kill find_inode_number() the only remaining caller (in ncpfs) is guaranteed to return 0 - we only hit it if we'd just checked that there's no dentry with such name. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
60545d0d |
|
06-Jun-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6d4ade98 |
|
14-Jun-2013 |
Steven Whitehouse <swhiteho@redhat.com> |
GFS2: Add atomic_open support I've restricted atomic_open to only operate on regular files, although I still don't understand why atomic_open should not be possible also for directories on GFS2. That can always be added in later though, if it makes sense. The ->atomic_open function can be passed negative dentries, which in most cases means either ENOENT (->lookup) or a call to d_instantiate (->create). In the GFS2 case though, we need to actually perform the look up, since we do not know whether there has been a new inode created on another node. The look up calls d_splice_alias which then tries to rehash the dentry - so the solution here is to simply check for that in d_splice_alias. The same issue is likely to affect any other cluster filesystem implementing ->atomic_open Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields fieldses org> Cc: Jeff Layton <jlayton@redhat.com>
|
#
9ed53b12 |
|
11-Mar-2013 |
Wei Yongjun <yongjun_wei@trendmicro.com.cn> |
vfs: use list_move instead of list_del/list_add Using list_move() instead of list_del() + list_add(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
61572bb1 |
|
15-Apr-2013 |
Yan, Zheng <zheng.z.yan@intel.com> |
fs: remove dentry_lru_prune() When pruning a dentry, its ancestor dentry can also be pruned. But the ancestor dentry does not go through dput(), so it does not get put on the dentry LRU. Hence associating d_prune with removing the dentry from the LRU is the wrong. The fix is remove dentry_lru_prune(). Call file system's d_prune() callback directly when pruning dentries. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
421348f1 |
|
30-Apr-2013 |
Greg Thelen <gthelen@google.com> |
fs/dcache.c: add cond_resched() to shrink_dcache_parent() Call cond_resched() in shrink_dcache_parent() to maintain interactivity. Before this patch: void shrink_dcache_parent(struct dentry * parent) { while ((found = select_parent(parent, &dispose)) != 0) shrink_dentry_list(&dispose); } select_parent() populates the dispose list with dentries which shrink_dentry_list() then deletes. select_parent() carefully uses need_resched() to avoid doing too much work at once. But neither shrink_dcache_parent() nor its called functions call cond_resched(). So once need_resched() is set select_parent() will return single dentry dispose list which is then deleted by shrink_dentry_list(). This is inefficient when there are a lot of dentry to process. This can cause softlockup and hurts interactivity on non preemptable kernels. This change adds cond_resched() in shrink_dcache_parent(). The benefit of this is that need_resched() is quickly cleared so that future calls to select_parent() are able to efficiently return a big batch of dentry. These additional cond_resched() do not seem to impact performance, at least for the workload below. Here is a program which can cause soft lockup if other system activity sets need_resched(). int main() { struct rlimit rlim; int i; int f[100000]; char buf[20]; struct timeval t1, t2; double diff; /* cleanup past run */ system("rm -rf x"); /* boost nfile rlimit */ rlim.rlim_cur = 200000; rlim.rlim_max = 200000; if (setrlimit(RLIMIT_NOFILE, &rlim)) err(1, "setrlimit"); /* make directory for files */ if (mkdir("x", 0700)) err(1, "mkdir"); if (gettimeofday(&t1, NULL)) err(1, "gettimeofday"); /* populate directory with open files */ for (i = 0; i < 100000; i++) { snprintf(buf, sizeof(buf), "x/%d", i); f[i] = open(buf, O_CREAT); if (f[i] == -1) err(1, "open"); } /* close some of the files */ for (i = 0; i < 85000; i++) close(f[i]); /* unlink all files, even open ones */ system("rm -rf x"); if (gettimeofday(&t2, NULL)) err(1, "gettimeofday"); diff = (((double)t2.tv_sec * 1000000 + t2.tv_usec) - ((double)t1.tv_sec * 1000000 + t1.tv_usec)); printf("done: %g elapsed\n", diff/1e6); return 0; } Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7ea600b5 |
|
26-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
Nest rename_lock inside vfsmount_lock ... lest we get livelocks between path_is_under() and d_path() and friends. The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks; it is possible to have thread B spin on attempt to take lock shared while thread A is already holding it shared, if B is on lower-numbered CPU than A and there's a thread C spinning on attempt to take the same lock exclusive. As the result, we need consistent ordering between vfsmount_lock (lglock) and rename_lock (seq_lock), even though everything that takes both is going to take vfsmount_lock only shared. Spotted-by: Brad Spengler <spender@grsecurity.net> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b67bfe0d |
|
27-Feb-2013 |
Sasha Levin <sasha.levin@oracle.com> |
hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ecf3d1f1 |
|
20-Feb-2013 |
Jeff Layton <jlayton@kernel.org> |
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4f522a24 |
|
11-Feb-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
d_hash_and_lookup(): export, switch open-coded instances * calling conventions change - ERR_PTR() is returned on ->d_hash() errors; NULL is just for dcache miss now. * exported, open-coded instances in ncpfs and cifs converted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
740da42e |
|
30-Jan-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
__d_materialise_unique() is too generic Its first argument is always non-root, while the second one is always root. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
da2d8455 |
|
24-Jan-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
constify d_lookup() arguments Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a713ca2a |
|
24-Jan-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
constify __d_lookup() arguments Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ad8ca374 |
|
14-Jan-2013 |
Jeff Layton <jlayton@kernel.org> |
vfs: remove d_path_with_unreachable The last caller was removed >2 years ago in commit 7b2a69ba7. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b911a6bd |
|
08-Nov-2012 |
NeilBrown <neilb@suse.de> |
vfs: d_obtain_alias() needs to use "/" as default name. NFS appears to use d_obtain_alias() to create the root dentry rather than d_make_root. This can cause 'prepend_path()' to complain that the root has a weird name if an NFS filesystem is lazily unmounted. e.g. if "/mnt" is an NFS mount then { cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; } will cause a WARN message like WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0() ... Root dentry has weird name <> to appear in kernel logs. So change d_obtain_alias() to use "/" rather than "" as the anonymous name. Signed-off-by: NeilBrown <neilb@suse.de> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
39e3c955 |
|
28-Nov-2012 |
Jeff Layton <jlayton@kernel.org> |
vfs: remove DCACHE_NEED_LOOKUP The code that relied on that flag was ripped out of btrfs quite some time ago, and never added back. Josef indicated that he was going to take a different approach to the problem in btrfs, and that we could just eliminate this flag. Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8110e16d |
|
17-Sep-2012 |
Miklos Szeredi <miklos@szeredi.hu> |
vfs: dcache: fix deadlock in tree traversal IBM reported a deadlock in select_parent(). This was found to be caused by taking rename_lock when already locked when restarting the tree traversal. There are two cases when the traversal needs to be restarted: 1) concurrent d_move(); this can only happen when not already locked, since taking rename_lock protects against concurrent d_move(). 2) racing with final d_put() on child just at the moment of ascending to parent; rename_lock doesn't protect against this rare race, so it can happen when already locked. Because of case 2, we need to be able to handle restarting the traversal when rename_lock is already held. This patch fixes all three callers of try_to_ascend(). IBM reported that the deadlock is gone with this patch. [ I rewrote the patch to be smaller and just do the "goto again" if the lock was already held, but credit goes to Miklos for the real work. - Linus ] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fd517909 |
|
18-Sep-2012 |
J. Bruce Fields <bfields@fieldses.org> |
trivial select_parent documentation fix "Search list for X" sounds like you're trying to find X on a list. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1fe0c023 |
|
19-Sep-2012 |
Alan Cox <alan@linux.intel.com> |
vfs: delete surplus inode NULL check Each iteration of d_delete we reload inode from dentry->d_inode and then call S_ISDIR(inode-i_mode), so inode cannot possibly be NULL shortly afterwards unless something went horribly wrong. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b161dfa6 |
|
17-Sep-2012 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill() IBM reported a soft lockup after applying the fix for the rename_lock deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression in kernel 2.6.38") was found to be the culprit. The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the dentry was killed. This flag can be set on non-killed dentries too, which results in infinite retries when trying to traverse the dentry tree. This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is only set in d_kill() and makes try_to_ascend() test only this flag. IBM reported successful test results with this patch. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ee3efa91 |
|
08-Jun-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
__d_unalias() should refuse to move mountpoints Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b3d9b7a3 |
|
09-Jun-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: switch i_dentry/d_alias to hlist Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f7a99c5b |
|
08-Jun-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of ->mnt_longterm it's enough to set ->mnt_ns of internal vfsmounts to something distinct from all struct mnt_namespace out there; then we can just use the check for ->mnt_ns != NULL in the fast path of mntput_no_expire() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
32ba9c3f |
|
08-Jun-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
Revert "vfs: stop d_splice_alias creating directory aliases" This reverts commit 7732a557b1342c6e6966efb5f07effcf99f56167 (and commit 3f50fff4dace23d3cfeb195d5cd4ee813cee68b7, which was a follow-up cleanup). We're chasing an elusive bug that Dave Jones can apparently reproduce using his system call fuzzer tool, and that looks like some kind of locking ordering problem on the directory i_mutex chain. Our i_mutex locking is rather complex, and depends on the topological ordering of the directories, which is why we have been very wary of splicing directory entries around. Of course, we really don't want to ever see aliased unconnected directories anyway, so none of this should ever happen, but this revert aims to basically get us back to a known older state. Bruce points to some of the previous discussion at http://marc.info/?i=<20110310105821.GE22723@ZenIV.linux.org.uk> and in particular a long post from Neil: http://marc.info/?i=<20110311150749.2fa2be66@notabene.brown> It should be noted that it's possible that Dave's problems come from other changes altohgether, including possibly just the fact that Dave constantly is teachning his fuzzer new tricks. So what appears to be a new bug could in fact be an old one that just gets newly triggered, but reverting these patches as "still under heavy discussion" is the right thing regardless. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3f50fff4 |
|
09-May-2012 |
J. Bruce Fields <bfields@redhat.com> |
vfs: remove unused __d_splice_alias argument Nobody sets want_disconn any more. Reported-by: Peng Tao <bergwolf@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7732a557 |
|
09-May-2012 |
J. Bruce Fields <bfields@redhat.com> |
vfs: stop d_splice_alias creating directory aliases A directory should never have more than one dentry pointing to it. But d_splice_alias() will add one if it finds a directory with an already-existing non-DISCONNECTED dentry. I can't find an obvious reproducer, but I also can't see what prevents d_splice_alias() from encountering such a case. It therefore seems safest to allow d_splice_alias to use any dentry it finds. (Prior to the removal of dentry_unhash() from vfs_rmdir(), around v3.0, this could cause an nfsd deadlock like this: - Somebody attempts to remove a non-empty directory. - The dentry_unhash() in vfs_rmdir() unhashes the dentry pointing to the non-empty directory. - ->rmdir() then fails with -ENOTEMPTY - Before the vfs_rmdir() caller reaches dput(), an nfsd process in rename looks up the directory by filehandle; at the end of that lookup, this dentry is found by d_alloc_anon(), and a reference is taken on it, preventing dput() from removing it. - A regular lookup of the directory calls d_splice_alias(), finds only an unhashed (not a DISCONNECTED) dentry, and insteads adds a new one, so the directory now has two dentries. - The nfsd process in rename, which was previously looking up the source directory of the rename, now looks up the target directory (which is the same), and gets the dentry newly created by the previous lookup. - The rename, seeing two different dentries, assumes this is a cross-directory rename and attempts to take the i_mutex on the directory twice. That reproducer no longer exists, but I don't think there was anything fundamentally incorrect about the vfs_rmdir() behavior there, so I think the real fault was here in d_splice_alias().) Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
962830df |
|
07-May-2012 |
Andi Kleen <ak@linux.intel.com> |
brlocks/lglocks: API cleanups lglocks and brlocks are currently generated with some complicated macros in lglock.h. But there's no reason to not just use common utility functions and put all the data into a common data structure. In preparation, this patch changes the API to look more like normal function calls with pointers, not magic macros. The patch is rather large because I move over all users in one go to keep it bisectable. This impacts the VFS somewhat in terms of lines changed. But no actual behaviour change. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
31fe62b9 |
|
23-May-2012 |
Tim Bird <tim.bird@am.sony.com> |
mm: add a low limit to alloc_large_system_hash UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2e321806 |
|
21-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
Revert "vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu" This reverts commit 8c01a529b861ba97c7d78368e6a5d4d42e946f75. It turns out the d_unhashed() check isn't unnecessary after all: while it's true that unhashing will increment the sequence numbers, that does not necessarily invalidate the RCU lookup, because it might have seen the dentry pointer (before it got unhashed), but by the time it loaded the sequence number, it could have seen the *new* sequence number (after it got unhashed). End result: we might look up an unhashed dentry that is about to be freed, with the sequence number never indicating anything bad about it. So checking that the dentry is still hashed (*after* reading the sequence number) is indeed the proper fix, and was never unnecessary. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6326c71f |
|
21-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: be even more careful about dentry RCU name lookups Miklos Szeredi points out that we need to also worry about memory odering when doing the dentry name comparison asynchronously with RCU. In particular, doing a rename can do a memcpy() of one dentry name over another, and we want to make sure that any unlocked reader will always see the proper terminating NUL character, so that it won't ever run off the allocation. Rather than having to be extra careful with the name copy or at lookup time for each character, this resolves the issue by making sure that all names that are inlined in the dentry always have a NUL character at the end of the name allocation. If we do that at dentry allocation time, we know that no future name copy will ever change that final NUL to anything else, so there are no memory ordering issues. So even if a concurrent rename ends up overwriting the NUL character that terminates the original name, we always know that there is one final NUL at the end, and there is no worry about the lockless RCU lookup traversing the name too far. The out-of-line allocations are never copied over, so we can just make sure that we write the name (with terminating NULL) and do a write barrier before we expose the name to anything else by setting it in the dentry. Reported-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
26fe5750 |
|
10-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: make it possible to access the dentry hash/len as one 64-bit entry This allows comparing hash and len in one operation on 64-bit architectures. Right now only __d_lookup_rcu() takes advantage of this, since that is the case we care most about. The use of anonymous struct/unions hides the alternate 64-bit approach from most users, the exception being a few cases where we initialize a 'struct qstr' with a static initializer. This makes the problematic cases use a new QSTR_INIT() helper function for that (but initializing just the name pointer with a "{ .name = xyzzy }" initializer remains valid, as does just copying another qstr structure). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ee983e89 |
|
10-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: move dentry name length comparison from dentry_cmp() into callers All callers do want to check the dentry length, but some of them can check the length and the hash together, so doing it in dentry_cmp() can be counter-productive. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
94753db5 |
|
10-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: do the careful dentry name access for all dentry_cmp cases Commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces") did the careful ACCESS_ONCE() of the dentry name only for the word-at-a-time case, even though the issue is generic. Admittedly I don't really see gcc ever reloading the value in the middle of the loop, so the ACCESS_ONCE() protects us from a fairly theoretical issue. But better safe than sorry. Also, this consolidates the common parts of the word-at-a-time and bytewise logic, which includes checking the length. We'll be changing that later. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8c01a529 |
|
10-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu The check for d_unhashed() is not strictly incorrect, but at the same time it is also not sensible. The actual dentry removal from the dentry hash chains is totally asynchronous to the __d_lookup_rcu() logic, and we depend on __d_drop() updating the sequence number to invalidate any lookup of an unhashed dentry. So checking d_unhashed() is not incorrect, but it's not useful either: the code has to work correctly even without it. So just remove it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
12f8ad4b |
|
04-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces The calling conventions for __d_lookup_rcu() and dentry_cmp() are annoying in different ways, and there is actually one single underlying reason for both of the annoyances. The fundamental reason is that we do the returned dentry sequence number check inside __d_lookup_rcu() instead of doing it in the caller. This results in two annoyances: - __d_lookup_rcu() now not only needs to return the dentry and the sequence number that goes along with the lookup, it also needs to return the inode pointer that was validated by that sequence number check. - and because we did the sequence number check early (to validate the name pointer and length) we also couldn't just pass the dentry itself to dentry_cmp(), we had to pass the counted string that contained the name. So that sequence number decision caused two separate ugly calling conventions. Both of these problems would be solved if we just did the sequence number check in the caller instead. There's only one caller, and that caller already has to do the sequence number check for the parent anyway, so just do that. That allows us to stop returning the dentry->d_inode in that in-out argument (pointer-to-pointer-to-inode), so we can make the inode argument just a regular input inode pointer. The caller can just load the inode from dentry->d_inode, and then do the sequence number check after that to make sure that it's synchronized with the name we looked up. And it allows us to just pass in the dentry to dentry_cmp(), which is what all the callers really wanted. Sure, dentry_cmp() has to be a bit careful about the dentry (which is not stable during RCU lookup), but that's actually very simple. And now that dentry_cmp() can clearly see that the first string argument is a dentry, we can use the direct word access for that, instead of the careful unaligned zero-padding. The dentry name is always properly aligned, since it is a single path component that is either embedded into the dentry itself, or was allocated with kmalloc() (see __d_alloc). Finally, this also uninlines the nasty slow-case for dentry comparisons: that one *does* need to do a sequence number check, since it will call in to the low-level filesystems, and we want to give those a stable inode pointer and path component length/start arguments. Doing an extra sequence check for that slow case is not a problem, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e419b4cc |
|
03-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: make word-at-a-time accesses handle a non-existing page It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that can have holes in the kernel address space: it seems to happen easily with Xen, and it looks like the AMD gart64 code will also punch holes dynamically. Actually hitting that case is still very unlikely, so just do the access, and take an exception and fix it up for the very unlikely case of it being a page-crosser with no next page. And hey, this abstraction might even help other architectures that have other issues with unaligned word accesses than the possible missing next page. IOW, this could do the byte order magic too. Peter Anvin fixed a thinko in the shifting for the exception case. Reported-and-tested-by: Jana Saout <jana@saout.de> Cc: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b18dafc8 |
|
26-Mar-2012 |
Michel Lespinasse <walken@google.com> |
vfs: fix d_ancestor() case in d_materialize_unique In d_materialise_unique() there are 3 subcases to the 'aliased dentry' case; in two subcases the inode i_lock is properly released but this does not occur in the -ELOOP subcase. This seems to have been introduced by commit 1836750115f2 ("fix loop checks in d_materialise_unique()"). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: stable@vger.kernel.org # v3.0+ [ Added a comment, and moved the unlock to where we generate the -ELOOP, which seems to be more natural. You probably can't actually trigger this without a buggy network file server - d_materialize_unique() is for finding aliases on non-local filesystems, and the d_ancestor() case is for a hardlinked directory loop. But we should be robust in the case of such buggy servers anyway. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1f1e6e52 |
|
18-Mar-2012 |
Randy Dunlap <rdunlap@infradead.org> |
fs: fix kernel-doc warnings in dcache.c Fix kernel-doc warnings in fs/dcache.c: Warning(fs/dcache.c:1743): No description found for parameter 'seqp' Warning(fs/dcache.c:1743): Excess function parameter 'seq' description in '__d_lookup_rcu' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
32991ab3 |
|
12-Feb-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: d_alloc_root() gone all callers converted to d_make_root() by now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6d7d1a0d |
|
19-Mar-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: get rid of batshit-insane pointless dentry hash calculations For some odd historical reason, the final mixing round for the dentry cache hash table lookup had an insane "xor with big constant" logic. In two places. The big constant that is being xor'ed is GOLDEN_RATIO_PRIME, which is a fairly random-looking number that is designed to be *multiplied* with so that the bits get spread out over a whole long-word. But xor'ing with it is insane. It doesn't really even change the hash - it really only shifts the hash around in the hash table. To make matters worse, the insane big constant is different on 32-bit and 64-bit builds, even though the name hash bits we use are always 32-bit (and the bits from the pointer we mix in effectively are too). It's all total voodoo programming, in other words. Now, some testing and analysis of the hash chains shows that the rest of the hash function seems to be fairly good. It does pick the right bits of the parent dentry pointer, for example, and while it's generally a bad idea to use an xor to mix down the upper bits (because if there is a repeating pattern, the xor can cause "destructive interference"), it seems to not have been a disaster. For example, replacing the hash with the normal "hash_long()" code (that uses the GOLDEN_RATIO_PRIME constant correctly, btw) actually just makes the hash worse. The hand-picked hash knew which bits of the pointer had the highest entropy, and hash_long() ends up mixing bits less optimally at least in some trivial tests. So the hash function overall seems fine, it just has that really odd "shift result around by a constant xor". So get rid of the silly xor, and replace the down-mixing of the bits with an add instead of an xor that tends to not have the same kind of destructive interference issues. Some stats on the resulting hash chains shows that they look statistically identical before and after, but the code is simpler and no longer makes you go "WTF?". Also, the incoming hash really is just "unsigned int", not a long, and there's no real point to worry about the high 26 bits of the dentry pointer for the 64-bit case, because they are all going to be identical anyway. So also change the hashing to be done in the more natural 'unsigned int' that is the real size of the actual hashed data anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bfcfaa77 |
|
06-Mar-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: use 'unsigned long' accesses for dcache name comparison and hashing Ok, this is hacky, and only works on little-endian machines with goo unaligned handling. And even then only with CONFIG_DEBUG_PAGEALLOC disabled, since it can access up to 7 bytes after the pathname. But it runs like a bat out of hell. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5483f18e |
|
04-Mar-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: move dentry_cmp from <linux/dcache.h> to fs/dcache.c It's only used inside fs/dcache.c, and we're going to play games with it for the word-at-a-time patches. This time we really don't even want to export it, because it really is an internal function to fs/dcache.c, and has been since it was introduced. Having it in that extremely hot header file (it's included in pretty much everything, thanks to <linux/fs.h>) is a disaster for testing different versions, and is utterly pointless. We really should have some kind of header file diet thing, where we figure out which parts of header files are really better off private and only result in more expensive compiles. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8966be90 |
|
02-Mar-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: trivial __d_lookup_rcu() cleanups These don't change any semantics, but they clean up the code a bit and mark some arguments appropriately 'const'. They came up as I was doing the word-at-a-time dcache name accessor code, and cleaning this up now allows me to send out a smaller relevant interesting patch for the experimental stuff. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
630d9c47 |
|
16-Nov-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
fs: reduce the use of module.h wherever possible For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
074b8517 |
|
08-Feb-2012 |
Dimitri Sivanich <sivanich@sgi.com> |
vfs: fix panic in __d_lookup() with high dentry hashtable counts When the number of dentry cache hash table entries gets too high (2147483648 entries), as happens by default on a 16TB system, use of a signed integer in the dcache_init() initialization loop prevents the dentry_hashtable from getting initialized, causing a panic in __d_lookup(). Fix this in dcache_init() and similar areas. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
46f72b34 |
|
10-Jan-2012 |
Sage Weil <sage@newdream.net> |
vfs: export symbol d_find_any_alias() Ceph needs this. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net>
|
#
eaf5f907 |
|
10-Jan-2012 |
Miklos Szeredi <miklos@szeredi.hu> |
fix shrink_dcache_parent() livelock Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may cause shrink_dcache_parent() to loop forever. Here's what appears to happen: 1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1 2 - CPU1: select_parent(P) locks P->d_lock 3 - CPU0: shrink_dentry_list() locks C->d_lock dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock 4 - CPU1: select_parent(P) locks C->d_lock, moves C from dispose list being processed on CPU0 to the new dispose list, returns 1 5 - CPU0: shrink_dentry_list() finds dispose list empty, returns 6 - Goto 2 with CPU0 and CPU1 switched Basically select_parent() steals the dentry from shrink_dentry_list() and thinks it found a new one, causing shrink_dentry_list() to think it's making progress and loop over and over. One way to trigger this is to make udev calls stat() on the sysfs file while it is going away. Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick: ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true" Then execute the following loop: while true; do echo -bond0 > /sys/class/net/bonding_masters echo +bond0 > /sys/class/net/bonding_masters echo -bond1 > /sys/class/net/bonding_masters echo +bond1 > /sys/class/net/bonding_masters done One fix would be to check all callers and prevent concurrent calls to shrink_dcache_parent(). But I think a better solution is to stop the stealing behavior. This patch adds a new dentry flag that is set when the dentry is added to the dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a new reference just before being pruned. If the dentry has this flag, select_parent() will skip it and let shrink_dentry_list() retry pruning it. With select_parent() skipping those dentries there will not be the appearance of progress (new dentries found) when there is none, hence shrink_dcache_parent() will not loop forever. Set the flag is also set in prune_dcache_sb() for consistency as suggested by Linus. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
adc0e91a |
|
08-Jan-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: new helper - d_make_root() d_alloc_root() with iput() in case of allocation failure... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b48f03b3 |
|
23-Aug-2011 |
Dave Chinner <david@fromorbit.com> |
dcache: use a dispose list in select_parent select_parent currently abuses the dentry cache LRU to provide cleanup features for child dentries that need to be freed. It moves them to the tail of the LRU, then tells shrink_dcache_parent() to calls __shrink_dcache_sb to unconditionally move them to a dispose list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to relock the dentries to move them off the LRU onto the dispose list, but otherwise does not touch the dentries that select_parent() moved to the tail of the LRU. It then passses the dispose list to shrink_dentry_list() which tries to free the dentries. IOWs, the use of __shrink_dcache_sb() is superfluous - we can build exactly the same list of dentries for disposal directly in select_parent() and call shrink_dentry_list() instead of calling __shrink_dcache_sb() to do that. This means that we avoid long holds on the lru lock walking the LRU moving dentries to the dispose list We also avoid the need to relock each dentry just to move it off the LRU, reducing the numebr of times we lock each dentry to dispose of them in shrink_dcache_parent() from 3 to 2 times. Further, we remove one of the two callers of __shrink_dcache_sb(). This also means that __shrink_dcache_sb can be moved into back into prune_dcache_sb() and we no longer have to handle referenced dentries conditionally, simplifying the code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
143c8c91 |
|
24-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: mnt_ns moved to struct mount Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a73324da |
|
24-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: move mnt_mountpoint to struct mount Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0714a533 |
|
24-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: now it can be done - make mnt_parent point to struct mount Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3376f34f |
|
24-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: mnt_parent moved to struct mount the second victim... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
676da58d |
|
24-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: spread struct mount - mnt_has_parent Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
afac7cba |
|
23-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: more mnt_parent cleanups a) mount --move is checking that ->mnt_parent is non-NULL before looking if that parent happens to be shared; ->mnt_parent is never NULL and it's not even an misspelled !mnt_has_parent() b) pivot_root open-codes is_path_reachable(), poorly. c) so does path_is_under(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b2dba1af |
|
23-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: new internal helper: mnt_has_parent(mnt) vfsmounts have ->mnt_parent pointing either to a different vfsmount or to itself; it's never NULL and termination condition in loops traversing the tree towards root is mnt == mnt->mnt_parent. At least one place (see the next patch) is confused about what's going on; let's add an explicit helper checking it right way and use it in all places where we need it. Not that there had been too many, but... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
02125a82 |
|
05-Dec-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API __d_path() API is asking for trouble and in case of apparmor d_namespace_path() getting just that. The root cause is that when __d_path() misses the root it had been told to look for, it stores the location of the most remote ancestor in *root. Without grabbing references. Sure, at the moment of call it had been pinned down by what we have in *path. And if we raced with umount -l, we could have very well stopped at vfsmount/dentry that got freed as soon as prepend_path() dropped vfsmount_lock. It is safe to compare these pointers with pre-existing (and known to be still alive) vfsmount and dentry, as long as all we are asking is "is it the same address?". Dereferencing is not safe and apparmor ended up stepping into that. d_namespace_path() really wants to examine the place where we stopped, even if it's not connected to our namespace. As the result, it looked at ->d_sb->s_magic of a dentry that might've been already freed by that point. All other callers had been careful enough to avoid that, but it's really a bad interface - it invites that kind of trouble. The fix is fairly straightforward, even though it's bigger than I'd like: * prepend_path() root argument becomes const. * __d_path() is never called with NULL/NULL root. It was a kludge to start with. Instead, we have an explicit function - d_absolute_root(). Same as __d_path(), except that it doesn't get root passed and stops where it stops. apparmor and tomoyo are using it. * __d_path() returns NULL on path outside of root. The main caller is show_mountinfo() and that's precisely what we pass root for - to skip those outside chroot jail. Those who don't want that can (and do) use d_path(). * __d_path() root argument becomes const. Everyone agrees, I hope. * apparmor does *NOT* try to use __d_path() or any of its variants when it sees that path->mnt is an internal vfsmount. In that case it's definitely not mounted anywhere and dentry_path() is exactly what we want there. Handling of sysctl()-triggered weirdness is moved to that place. * if apparmor is asked to do pathname relative to chroot jail and __d_path() tells it we it's not in that jail, the sucker just calls d_absolute_path() instead. That's the other remaining caller of __d_path(), BTW. * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway - the normal seq_file logics will take care of growing the buffer and redoing the call of ->show() just fine). However, if it gets path not reachable from root, it returns SEQ_SKIP. The only caller adjusted (i.e. stopped ignoring the return value as it used to do). Reviewed-by: John Johansen <john.johansen@canonical.com> ACKed-by: John Johansen <john.johansen@canonical.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org
|
#
dd179946 |
|
16-Aug-2011 |
David Howells <dhowells@redhat.com> |
VFS: Log the fact that we've given ELOOP rather than creating a loop To prevent an NFS server from being used to create a directory loop in an NFS superblock on the client, the following patch was committed: commit 1836750115f20b774e55c032a3893e8c5bdf41ed Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Jul 12 21:42:24 2011 -0400 Subject: fix loop checks in d_materialise_unique() This causes ELOOP to be reported to anyone trying to access the dentry that would otherwise cause the kernel to complete the loop. However, no indication is given to the caller as to why an operation that ought to work doesn't. The fault is with the kernel, which doesn't want to try and solve the problem as it gets horrendously messy if there's another mountpoint somewhere in the trees being spliced that can't be moved[*]. [*] The real problem is that we don't handle the excision of a subtree that gets moved _out_ of what we can see. This can happen on the server where a directory is merely moved between two other dirs on the same filesystem, but where destination dir is not accessible by the client. So, given the choice to return ELOOP rather than trying to reconfigure the dentry tree, we should give the caller some indication of why they aren't being allowed to make what should be a legitimate request and log a message. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
50e69630 |
|
07-Nov-2011 |
Al Viro <viro@ZenIV.linux.org.uk> |
vfs: d_invalidate() should leave mountpoints alone Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f0023bc6 |
|
28-Oct-2011 |
Sage Weil <sage@newdream.net> |
vfs: add d_prune dentry operation This adds a d_prune dentry operation that is called by the VFS prior to pruning (i.e. unhashing and killing) a hashed dentry from the dcache. Wrap dentry_lru_del() and use the new _prune() helper in the cases where we are about to unhash and kill the dentry. This will be used by Ceph to maintain a flag indicating whether the complete contents of a directory are contained in the dcache, allowing it to satisfy lookups and readdir without addition server communication. Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other DCACHE_OP_ bits. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
830c0f0e |
|
06-Aug-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: renumber DCACHE_xyz flags, remove some stale ones Gcc tends to generate better code with small integers, including the DCACHE_xyz flag tests - so move the common ones to be first in the list. Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and DCACHE_AUTOFS_PENDING values, their users no longer exists in the source tree. And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the common case to be a nice straight-line fall-through. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2af14162 |
|
03-Aug-2011 |
Randy Dunlap <rdunlap@infradead.org> |
fs/dcache.c: fix new kernel-doc warning Fix new kernel-doc warning in fs/dcache.c: Warning(fs/dcache.c:797): No description found for parameter 'sb' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
43c1c9cd |
|
07-Jun-2011 |
David Howells <dhowells@redhat.com> |
VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock Reorganise shrink_dcache_for_umount_subtree() in light of the demise of dcache_lock. Without that dcache_lock, there is no need for the batching of removal of dentries from the system under it (we wanted to make intensive use of the locked data whilst we held it, but didn't want to hold it for long at a time). This works, provided the preceding patch is correct in its removal of locking on dentry->d_lock on the basis that no one should be locking these dentries any more as the whole superblock is defunct. With this patch, the calls to dentry_lru_del() and __d_shrink() are placed at the point where each dentry is detached handled. It is possible that, as an alternative, the batching should still be done - but only for dentry_lru_del() of all a dentry's children in one go. In such a case, the batching would be done under dcache_lru_lock. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c6627c60 |
|
07-Jun-2011 |
David Howells <dhowells@redhat.com> |
VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree() Locks of the dcache_lock were replaced by locks of dentry->d_lock in commits such as: 2304450783dfde7b0b94ae234edd0dbffa865073 2fd6b7f50797f2e993eea59e0a0b8c6399c811dc as part of the RCU-based pathwalk changes, despite the fact that the caller (shrink_dcache_for_umount()) notes in the banner comment the reasons that d_lock is not necessary in these functions: /* * destroy the dentries attached to a superblock on unmounting * - we don't need to use dentry->d_lock because: * - the superblock is detached from all mountings and open files, so the * dentry trees will not be rearranged by the VFS * - s_umount is write-locked, so the memory pressure shrinker will ignore * any dentries belonging to this superblock that it comes across * - the filesystem itself is no longer permitted to rearrange the dentries * in this superblock */ So remove these locks. If the locks are actually necessary, then this banner comment should be altered instead. The hash table chains are protected by 1-bit locks in the hash table heads, so those shouldn't be a problem. Note that to make this work, __d_drop() has to be split so that the RCUwalk barrier can be avoided. This causes problems otherwise as it has an assertion that dentry->d_lock is locked - but there is no need for that as no one else can be trying to access this dentry, except to step over it (and that should be handled by d_free(), I think). Signed-off-by: David Howells <dhowells@redhat.com> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
35f40ef0 |
|
07-Jun-2011 |
David Howells <dhowells@redhat.com> |
VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree() Remove the detached-dentry counter from shrink_dcache_for_umount_subtree() as the value it computes is no longer used as of commit 312d3ca856d369bb04d0443846b85b4cdde6fa8a which made the nr_dentry counters summed per-CPU rather than global atomic. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c46c8877 |
|
26-Jul-2011 |
Jeff Layton <jlayton@kernel.org> |
vfs: document locking requirements for d_move, __d_move and d_materialise_unique Adding a comment to d_materialise_unique per Al's request... d_move and __d_move have some pretty substantial locking requirements, but they are not clearly documented. Add some comments spelling them out. Also, document the requirement for the i_mutex of the parent in d_materialise_unique. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b91da88f |
|
21-Jul-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: drop conditional inode prefetch in __do_lookup_rcu It seems to hurt performance in real life. Yes, the inode will be used later, but the conditional doesn't seem to predict all that well (negative dentries are not uncommon) and it looks like the cost of prefetching is simply higher than depending on the cache doing the right thing. As usual. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
86c98e8c |
|
18-Jul-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
Remove dead code in dget_parent() ->d_parent is never NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4513d899 |
|
17-Jul-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
switch d_add_ci() to d_splice_alias() in "found negative" case as well Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b0d40c92 |
|
07-Jul-2011 |
Dave Chinner <dchinner@redhat.com> |
superblock: introduce per-sb cache shrinker infrastructure With context based shrinkers, we can implement a per-superblock shrinker that shrinks the caches attached to the superblock. We currently have global shrinkers for the inode and dentry caches that split up into per-superblock operations via a coarse proportioning method that does not batch very well. The global shrinkers also have a dependency - dentries pin inodes - so we have to be very careful about how we register the global shrinkers so that the implicit call order is always correct. With a per-sb shrinker callout, we can encode this dependency directly into the per-sb shrinker, hence avoiding the need for strictly ordering shrinker registrations. We also have no need for any proportioning code for the shrinker subsystem already provides this functionality across all shrinkers. Allowing the shrinker to operate on a single superblock at a time means that we do less superblock list traversals and locking and reclaim should batch more effectively. This should result in less CPU overhead for reclaim and potentially faster reclaim of items from each filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a9049376 |
|
08-Jul-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) ... and simplify the living hell out of callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a4464dbc |
|
07-Jul-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
Make ->d_sb assign-once and always non-NULL New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name). Allocates dentry, sets its ->d_sb to given superblock and sets ->d_op accordingly. Old d_alloc(NULL, name) callers are converted to that (all of them know what superblock they want). d_alloc() itself is left only for parent != NULl case; uses __d_alloc(), inserts result into the list of parent's children. Note that now ->d_sb is assign-once and never NULL *and* ->d_parent is never NULL either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
44396f4b |
|
31-May-2011 |
Josef Bacik <josef@redhat.com> |
fs: add a DCACHE_NEED_LOOKUP flag for d_flags Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order for readdir, but unfortunately in the case of anything that stats the files in order that readdir spits back (like oh say ls) that means we still have to do the normal lookup of the file, which means looking up our other index and then looking up the inode. What I want is a way to create dummy dentries when we find them in readdir so that when ls or anything else subsequently does a stat(), we already have the location information in the dentry and can go straight to the inode itself. The lookup stuff just assumes that if it finds a dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP flag so that the lookup code knows it still needs to run i_op->lookup() on the parent to get the inode for the dentry. I have tested this with btrfs and I went from something that looks like this http://people.redhat.com/jwhiter/ls-noreada.png To this http://people.redhat.com/jwhiter/ls-good.png Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
18367501 |
|
12-Jul-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
fix loop checks in d_materialise_unique() Both __d_unalias() and __d_materialise_dentry() need loop prevention. Grab rename_lock in caller, check for loops there... As a side benefit, we have dentry_lock_for_move() called only under rename_lock, which seriously reduces deadlock potential of the execrable "locking order" used for ->d_lock. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1495f230 |
|
24-May-2011 |
Ying Han <yinghan@google.com> |
vmscan: change shrinker API by passing shrink_control struct Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
268bb0ce |
|
20-May-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
sanitize <linux/prefetch.h> usage Commit e66eed651fd1 ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h, which uncovered several cases that had apparently relied on that rather obscure header file dependency. So this fixes things up a bit, using grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]') grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]') to guide us in finding files that either need <linux/prefetch.h> inclusion, or have it despite not needing it. There are more of them around (mostly network drivers), but this gets many core ones. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1879fd6a |
|
25-Apr-2011 |
Christoph Hellwig <hch@infradead.org> |
add hlist_bl_lock/unlock helpers Now that the whole dcache_hash_bucket crap is gone, go all the way and also remove the weird locking layering violations for locking the hash buckets. Add hlist_bl_lock/unlock helpers to move the locking into the list abstraction instead of requiring each caller to open code it. After all allowing for the bit locks is the whole point of these helpers over the plain hlist variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dea3667b |
|
24-Apr-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: get rid of insane dentry hashing rules The dentry hashing rules have been really quite complicated for a long while, in odd ways. That made functions like __d_drop() very fragile and non-obvious. In particular, whether a dentry was hashed or not was indicated with an explicit DCACHE_UNHASHED bit. That's despite the fact that the hash abstraction that the dentries use actually have a 'is this entry hashed or not' model (which is a simple test of the 'pprev' pointer). The reason that was done is because we used the normal 'is this entry unhashed' model to mark whether the dentry had _ever_ been hashed in the dentry hash tables, and that logic goes back many years (commit b3423415fbc2: "dcache: avoid RCU for never-hashed dentries"). That, in turn, meant that __d_drop had totally different unhashing logic for the dentry hash table case and for the anonymous dcache case, because in order to use the "is this dentry hashed" logic as a flag for whether it had ever been on the RCU hash table, we had to unhash such a dentry differently so that we'd never think that it wasn't 'unhashed' and wouldn't be free'd correctly. That's just insane. It made the logic really hard to follow, when there were two different kinds of "unhashed" states, and one of them (the one that used "list_bl_unhashed()") really had nothing at all to do with being unhashed per se, but with a very subtle lifetime rule instead. So turn all of it around, and make it logical. Instead of having a DENTRY_UNHASHED bit in d_flags to indicate whether the dentry is on the hash chains or not, use the hash chain unhashed logic for that. Suddenly "d_unhashed()" just uses "list_bl_unhashed()", and everything makes sense. And for the lifetime rule, just use an explicit DENTRY_RCUACCEES bit. If we ever insert the dentry into the dentry hash table so that it is visible to RCU lookup, we mark it DENTRY_RCUACCESS to show that it now needs the RCU lifetime rules. Now suddently that test at dentry free time makes sense too. And because unhashing now is sane and doesn't depend on where the dentry got unhashed from (because the dentry hash chain details doesn't have some subtle side effects), we can re-unify the __d_drop() logic and use common code for the unhashing. Also fix one more open-coded hash chain bit_spin_lock() that I missed in the previous chain locking cleanup commit. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b07ad996 |
|
23-Apr-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: get rid of 'struct dcache_hash_bucket' abstraction It's a useless abstraction for 'hlist_bl_head', and it doesn't actually help anything - quite the reverse. All the users end up having to know about the hlist_bl_head details anyway, using 'struct hlist_bl_node *' etc. So it just makes the code look confusing. And the cost of it is extra '&b->head' syntactic noise, but more importantly it spuriously makes the hash table dentry list look different from the per-superblock DCACHE_DISCONNECTED dentry list. As a result, the code ended up using ad-hoc locking for one case and special helper functions for what is really another totally identical case in the very same function. Make it all look and work the same. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7ebfa57f |
|
15-Apr-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: fix incorrect dentry_update_name_case() BUG_ON() test The case we should be verifying when updating the dentry name is that the _parent_ inode (the directory) semaphore is held, not the semaphore for the dentry itself. It's the directory locking that rename and readdir() etc all care about. The comment just above even says so - but then the BUG_ON() still checked the dentry inode itself. Very few people noticed, because this helper function really isn't used for very much, so you had to be using ncpfs to ever hit it. I think I should just remove the BUG_ON (the function really has just one user), but let's run with it fixed for a while before getting rid of it entirely. Reported-and-tested-by: Bongani Hlope <bonganih@bankservafrica.com> Reported-and-tested-by: Bernd Feige <bernd.feige@uniklinik-freiburg.de> Cc: Petr Vandrovec <petr@vandrovec.name>, Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
24ff6663 |
|
18-Nov-2010 |
Josef Bacik <josef@redhat.com> |
fs: call security_d_instantiate in d_obtain_alias V2 While trying to track down some NFS problems with BTRFS, I kept noticing I was getting -EACCESS for no apparent reason. Eric Paris and printk() helped me figure out that it was SELinux that was giving me grief, with the following denial type=AVC msg=audit(1290013638.413:95): avc: denied { 0x800000 } for pid=1772 comm="nfsd" name="" dev=sda1 ino=256 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file Turns out this is because in d_obtain_alias if we can't find an alias we create one and do all the normal instantiation stuff, but we don't do the security_d_instantiate. Usually we are protected from getting a hashed dentry that hasn't yet run security_d_instantiate() by the parent's i_mutex, but obviously this isn't an option there, so in order to deal with the case that a second thread comes in and finds our new dentry before we get to run security_d_instantiate(), we go ahead and call it if we find a dentry already. Eric assures me that this is ok as the code checks to see if the dentry has been initialized already so calling security_d_instantiate() against the same dentry multiple times is ok. With this patch I'm no longer getting errant -EACCESS values. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c83ce989 |
|
15-Mar-2011 |
Trond Myklebust <Trond.Myklebust@netapp.com> |
VFS: Fix the nfs sillyrename regression in kernel 2.6.38 The new vfs locking scheme introduced in 2.6.38 breaks NFS sillyrename because the latter relies on being able to determine the parent directory of the dentry in the ->iput() callback in order to send the appropriate unlink rpc call. Looking at the code that cares about races with dput(), there doesn't seem to be anything that specifically uses d_parent as a test for whether or not there is a race: - __d_lookup_rcu(), __d_lookup() all test for d_hashed() after d_parent - shrink_dcache_for_umount() is safe since nothing else can rearrange the dentries in that super block. - have_submount(), select_parent() and d_genocide() can test for a deletion if we set the DCACHE_DISCONNECTED flag when the dentry is removed from the parent's d_subdirs list. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org (2.6.38, needs commit c826cb7dfce8 "dcache.c: create helper function for duplicated functionality" ) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c826cb7d |
|
15-Mar-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
dcache.c: create helper function for duplicated functionality This creates a helper function for he "try to ascend into the parent directory" case, which was written out in triplicate before. With all the locking and subtle sequence number stuff, we really don't want to duplicate that kind of code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d891eedb |
|
18-Jan-2011 |
J. Bruce Fields <bfields@fieldses.org> |
fs/dcache: allow d_obtain_alias() to return unhashed dentries Without this patch, inodes are not promptly freed on last close of an unlinked file by an nfs client: client$ mount -tnfs4 server:/export/ /mnt/ client$ tail -f /mnt/FOO ... server$ df -i /export server$ rm /export/FOO (^C the tail -f) server$ df -i /export server$ echo 2 >/proc/sys/vm/drop_caches server$ df -i /export the df's will show that the inode is not freed on the filesystem until the last step, when it could have been freed after killing the client's tail -f. On-disk data won't be deallocated either, leading to possible spurious ENOSPC. This occurs because when the client does the close, it arrives in a compound with a putfh and a close, processed like: - putfh: look up the filehandle. The only alias found for the inode will be DCACHE_UNHASHED alias referenced by the filp this, so it creates a new DCACHE_DISCONECTED dentry and returns that instead. - close: closes the existing filp, which is destroyed immediately by dput() since it's DCACHE_UNHASHED. - end of the compound: release the reference to the current filehandle, and dput() the new DCACHE_DISCONECTED dentry, which gets put on the unused list instead of being destroyed immediately. Nick Piggin suggested fixing this by allowing d_obtain_alias to return the unhashed dentry that is referenced by the filp, instead of making it create a new dentry. Leave __d_find_alias() alone to avoid changing behavior of other callers. Also nfsd doesn't need all the checks of __d_find_alias(); any dentry, hashed or unhashed, disconnected or not, should work. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b0a4bb83 |
|
21-Jan-2011 |
Namhyung Kim <namhyung@gmail.com> |
fs: update comments to point correct document dcache-locking.txt is not exist any more, and the path was not correct anyway. Fix it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
ff5fdb61 |
|
22-Jan-2011 |
Randy Dunlap <randy.dunlap@oracle.com> |
fs: fix new dcache.c kernel-doc warnings Fix new fs/dcache.c kernel-doc warnings: Warning(fs/dcache.c:184): No description found for parameter 'dentry' Warning(fs/dcache.c:296): No description found for parameter 'parent' Warning(fs/dcache.c:1985): No description found for parameter 'dparent' Warning(fs/dcache.c:1985): Excess function parameter 'parent' description in 'd_validate' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9875cf80 |
|
14-Jan-2011 |
David Howells <dhowells@redhat.com> |
Add a dentry op to handle automounting rather than abusing follow_link() Add a dentry op (d_automount) to handle automounting directories rather than abusing the follow_link() inode operation. The operation is keyed off a new dentry flag (DCACHE_NEED_AUTOMOUNT). This also makes it easier to add an AT_ flag to suppress terminal segment automount during pathwalk and removes the need for the kludge code in the pathwalk algorithm to handle directories with follow_link() semantics. The ->d_automount() dentry operation: struct vfsmount *(*d_automount)(struct path *mountpoint); takes a pointer to the directory to be mounted upon, which is expected to provide sufficient data to determine what should be mounted. If successful, it should return the vfsmount struct it creates (which it should also have added to the namespace using do_add_mount() or similar). If there's a collision with another automount attempt, NULL should be returned. If the directory specified by the parameter should be used directly rather than being mounted upon, -EISDIR should be returned. In any other case, an error code should be returned. The ->d_automount() operation is called with no locks held and may sleep. At this point the pathwalk algorithm will be in ref-walk mode. Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is added to handle mountpoints. It will return -EREMOTE if the automount flag was set, but no d_automount() op was supplied, -ELOOP if we've encountered too many symlinks or mountpoints, -EISDIR if the walk point should be used without mounting and 0 if successful. The path will be updated to point to the mounted filesystem if a successful automount took place. __follow_mount() is replaced by follow_managed() which is more generic (especially with the patch that adds ->d_manage()). This handles transits from directories during pathwalk, including automounting and skipping over mountpoints (and holding processes with the next patch). __follow_mount_rcu() will jump out of RCU-walk mode if it encounters an automount point with nothing mounted on it. follow_dotdot*() does not handle automounts as you don't want to trigger them whilst following "..". I've also extracted the mount/don't-mount logic from autofs4 and included it here. It makes the mount go ahead anyway if someone calls open() or creat(), tries to traverse the directory, tries to chdir/chroot/etc. into the directory, or sticks a '/' on the end of the pathname. If they do a stat(), however, they'll only trigger the automount if they didn't also say O_NOFOLLOW. I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their inodes as automount points. This flag is automatically propagated to the dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could save AFS a private flag bit apiece, but is not strictly necessary. It would be preferable to do the propagation in d_set_d_op(), but that doesn't normally have access to the inode. [AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu() succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after that, rather than just returning with ungrabbed *path] Signed-off-by: David Howells <dhowells@redhat.com> Was-Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6f7f7caa |
|
14-Jan-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
Turn d_set_d_op() BUG_ON() into WARN_ON_ONCE() It's indicative of a real problem, and it actually triggers with autofs4, but the BUG_ON() is excessive. The autofs4 case is being fixed (to only set d_op in the ->lookup method) but not merged yet. In the meantime this gets the code limping along. Reported-by: Alex Elder <aelder@sgi.com> Cc: Ian Kent <raven@themaw.net> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
208898c1 |
|
18-Nov-2010 |
Randy Dunlap <randy.dunlap@oracle.com> |
fs: fix kernel-doc for dcache::prepend_path Fix function kernel-doc warning for prepend_path(): Warning(fs/dcache.c:1924): missing initial short description on line: Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1c977540 |
|
18-Nov-2010 |
Randy Dunlap <randy.dunlap@oracle.com> |
fs: fix kernel-doc for dcache::d_validate Fix function parameter kernel-doc for d_validate(): Warning(fs/dcache.c:1495): No description found for parameter 'parent' Warning(fs/dcache.c:1495): Excess function parameter 'dparent' description in 'd_validate' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c8aebb0c |
|
18-Dec-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
per-superblock default ->d_op Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9d55c369 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: implement faster dentry memcmp The standard memcmp function on a Westmere system shows up hot in profiles in the `git diff` workload (both parallel and single threaded), and it is likely due to the costs associated with trapping into microcode, and little opportunity to improve memory access (dentry name is not likely to take up more than a cacheline). So replace it with an open-coded byte comparison. This increases code size by 8 bytes in the critical __d_lookup_rcu function, but the speedup is huge, averaging 10 runs of each: git diff st user sys elapsed CPU before 1.15 2.57 3.82 97.1 after 1.14 2.35 3.61 96.8 git diff mt user sys elapsed CPU before 1.27 3.85 1.46 349 after 1.26 3.54 1.43 333 Elapsed time for single threaded git diff at 95.0% confidence: -0.21 +/- 0.01 -5.45% +/- 0.24% It's -0.66% +/- 0.06% elapsed time on my Opteron, so rep cmp costs on the fam10h seem to be relatively smaller, but there is still a win. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
e1bb5782 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: prefetch inode data in dcache lookup This makes single threaded git diff -1.25% +/- 0.05% elapsed time on my 2s12c24t Westmere system, and -0.86% +/- 0.05% on my 2s8c Barcelona, by prefetching the important first cacheline of the inode in while we do the actual name compare and other operations on the dentry. There was no measurable slowdown in the single file stat case, or the creat case (where negative dentries would be common). Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
4b936885 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: improve scalability of pseudo filesystems Regardless of how much we possibly try to scale dcache, there is likely always going to be some fundamental contention when adding or removing children under the same parent. Pseudo filesystems do not seem need to have connected dentries because by definition they are disconnected. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
873feea0 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache per-inode inode alias locking dcache_inode_lock can be replaced with per-inode locking. Use existing inode->i_lock for this. This is slightly non-trivial because we sometimes need to find the inode from the dentry, which requires d_inode to be stabilised (either with refcount or d_lock). Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
ceb5bdc2 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache per-bucket dcache hash locking We can turn the dcache hash locking from a global dcache_hash_lock into per-bucket locking. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
44a7d7a8 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: cache optimise dentry and inode for rcu-walk Put dentry and inode fields into top of data structure. This allows RCU path traversal to perform an RCU dentry lookup in a path walk by touching only the first 56 bytes of the dentry. We also fit in 8 bytes of inline name in the first 64 bytes, so for short names, only 64 bytes needs to be touched to perform the lookup. We should get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes of name in there, which will take care of 81% rather than 32% of the kernel tree. inode is also rearranged so that RCU lookup will only touch a single cacheline in the inode, plus one in the i_ops structure. This is important for directory component lookups in RCU path walking. In the kernel source, directory names average is around 6 chars, so this works. When we reach the last element of the lookup, we need to lock it and take its refcount which requires another cacheline access. Align dentry and inode operations structs, so members will be at predictable offsets and we can group common operations into head of structure. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
fb045adb |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache reduce branches in lookup path Reduce some branches and memory accesses in dcache lookup by adding dentry flags to indicate common d_ops are set, rather than having to check them. This saves a pointer memory access (dentry->d_op) in common path lookup situations, and saves another pointer load and branch in cases where we have d_op but not the particular operation. Patched with: git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
5f57cbcc |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache remove d_mounted Rather than keep a d_mounted count in the dentry, set a dentry flag instead. The flag can be cleared by checking the hash table to see if there are any mounts left, which is not time critical because it is performed at detach time. The mounted state of a dentry is only used to speculatively take a look in the mount hash table if it is set -- before following the mount, vfsmount lock is taken and mount re-checked without races. This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I might use later (and some configs have larger than 32-bit spinlocks which might make use of the hole). Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>: In autofs4, when expring direct (or offset) mounts we need to ensure that we block user path walks into the autofs mount, which is covered by another mount. To do this we clear the mounted status so that follows stop before walking into the mount and are essentially blocked until the expire is completed. The automount daemon still finds the correct dentry for the umount due to the follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as an inode operation for direct and offset mounts only and is called following the lookup that stopped at the covered mount. At the end of the expire the covering mount probably has gone away so the mounted status need not be restored. But we need to check this and only restore the mounted status if the expire failed. XXX: autofs may not work right if we have other mounts go over the top of it? Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
31e6b01f |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: rcu-walk for path lookup Perform common cases of path lookups without any stores or locking in the ancestor dentry elements. This is called rcu-walk, as opposed to the current algorithm which is a refcount based walk, or ref-walk. This results in far fewer atomic operations on every path element, significantly improving path lookup performance. It also avoids cacheline bouncing on common dentries, significantly improving scalability. The overall design is like this: * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk. * Take the RCU lock for the entire path walk, starting with the acquiring of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are not required for dentry persistence. * synchronize_rcu is called when unregistering a filesystem, so we can access d_ops and i_ops during rcu-walk. * Similarly take the vfsmount lock for the entire path walk. So now mnt refcounts are not required for persistence. Also we are free to perform mount lookups, and to assume dentry mount points and mount roots are stable up and down the path. * Have a per-dentry seqlock to protect the dentry name, parent, and inode, so we can load this tuple atomically, and also check whether any of its members have changed. * Dentry lookups (based on parent, candidate string tuple) recheck the parent sequence after the child is found in case anything changed in the parent during the path walk. * inode is also RCU protected so we can load d_inode and use the inode for limited things. * i_mode, i_uid, i_gid can be tested for exec permissions during path walk. * i_op can be loaded. When we reach the destination dentry, we lock it, recheck lookup sequence, and increment its refcount and mountpoint refcount. RCU and vfsmount locks are dropped. This is termed "dropping rcu-walk". If the dentry refcount does not match, we can not drop rcu-walk gracefully at the current point in the lokup, so instead return -ECHILD (for want of a better errno). This signals the path walking code to re-do the entire lookup with a ref-walk. Aside from the final dentry, there are other situations that may be encounted where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take a reference on the last good dentry) and continue with a ref-walk. Again, if we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup using ref-walk. But it is very important that we can continue with ref-walk for most cases, particularly to avoid the overhead of double lookups, and to gain the scalability advantages on common path elements (like cwd and root). The cases where rcu-walk cannot continue are: * NULL dentry (ie. any uncached path element) * parent with d_inode->i_op->permission or ACLs * dentries with d_revalidate * Following links In future patches, permission checks and d_revalidate become rcu-walk aware. It may be possible eventually to make following links rcu-walk aware. Uncached path elements will always require dropping to ref-walk mode, at the very least because i_mutex needs to be grabbed, and objects allocated. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
77812a1e |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: consolidate dentry kill sequence The tricky locking for disposing of a dentry is duplicated 3 times in the dcache (dput, pruning a dentry from the LRU, and pruning its ancestors). Consolidate them all into a single function dentry_kill. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
ec33679d |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: use RCU in shrink_dentry_list to reduce lock nesting Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
be182bff |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: reduce dcache_inode_lock width in lru scanning Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
89e60548 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache reduce prune_one_dentry locking prune_one_dentry can avoid quite a bit of locking in the common case where ancestors have an elevated refcount. Alternatively, we could have gone the other way and made fewer trylocks in the case where d_count goes to zero, but is probably less common. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
a734eb45 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache reduce d_parent locking Use RCU to simplify locking in dget_parent. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
dc0474be |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache rationalise dget variants dget_locked was a shortcut to avoid the lazy lru manipulation when we already held dcache_lock (lru manipulation was relatively cheap at that point). However, how that the lru lock is an innermost one, we never hold it at any caller, so the lock cost can now be avoided. We already have well working lazy dcache LRU, so it should be fine to defer LRU manipulations to scan time. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
357f8e65 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache reduce dcache_inode_lock dcache_inode_lock can be avoided in d_delete() and d_materialise_unique() in cases where it is not required. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
89ad485f |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache reduce locking in d_alloc Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
61f3dee4 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache reduce dput locking It is possible to run dput without taking data structure locks up-front. In many cases where we don't kill the dentry anyway, these locks are not required. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
58db63d0 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache avoid starvation in dcache multi-step operations Long lived dcache "multi-step" operations which retry on rename seq can be starved with a lot of rename activity. If they fail after the 1st pass, take the rename_lock for writing to avoid further starvation. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
b5c84bf6 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache remove dcache_lock dcache_lock no longer protects anything. remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
949854d0 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: Use rename lock and RCU for multi-step operations The remaining usages for dcache_lock is to allow atomic, multi-step read-side operations over the directory tree by excluding modifications to the tree. Also, to walk in the leaf->root direction in the tree where we don't have a natural d_lock ordering. This could be accomplished by taking every d_lock, but this would mean a huge number of locks and actually gets very tricky. Solve this instead by using the rename seqlock for multi-step read-side operations, retry in case of a rename so we don't walk up the wrong parent. Concurrent dentry insertions are not serialised against. Concurrent deletes are tricky when walking up the directory: our parent might have been deleted when dropping locks so also need to check and retry for that. We can also use the rename lock in cases where livelock is a worry (and it is introduced in subsequent patch). Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
9abca360 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: increase d_name lock coverage Cover d_name with d_lock in more cases, where there may be concurrent modification to it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
b23fb0a6 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: scale inode alias list Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list from concurrent modification. d_alias is also protected by d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
2fd6b7f5 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache scale subdirs Protect d_subdirs and d_child with d_lock, except in filesystems that aren't using dcache_lock for these anyway (eg. using i_mutex). Note: if we change the locking rule in future so that ->d_child protection is provided only with ->d_parent->d_lock, it may allow us to reduce some locking. But it would be an exception to an otherwise regular locking scheme, so we'd have to see some good results. Probably not worthwhile. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
da502956 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache scale d_unhashed Protect d_unhashed(dentry) condition with d_lock. This means keeping DCACHE_UNHASHED bit in synch with hash manipulations. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
b7ab39f6 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache scale dentry refcount Make d_count non-atomic and protect it with d_lock. This allows us to ensure a 0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when we start protecting many other dentry members with d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
23044507 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache scale lru Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent modification. d_lru is also protected by d_lock, which allows LRU lists to be accessed without the lru lock, using RCU in future patches. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
789680d1 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: dcache scale hash Add a new lock, dcache_hash_lock, to protect the dcache hash table from concurrent modification. d_hash is also protected by d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
ec2447c2 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
hostfs: simplify locking Remove dcache_lock locking from hostfs filesystem, and move it into dcache helpers. All that is required is a coherent path name. Protection from concurrent modification of the namespace after path name generation is not provided in current code, because dcache_lock is dropped before the path is used. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
b1e6a015 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: change d_hash for rcu-walk Change d_hash so it may be called from lock-free RCU lookups. See similar patch for d_compare for details. For in-tree filesystems, this is just a mechanical change. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
621e155a |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: change d_compare for rcu-walk Change d_compare so it may be called from lock-free RCU lookups. This does put significant restrictions on what may be done from the callback, however there don't seem to have been any problems with in-tree fses. If some strange use case pops up that _really_ cannot cope with the rcu-walk rules, we can just add new rcu-unaware callbacks, which would cause name lookup to drop out of rcu-walk mode. For in-tree filesystems, this is just a mechanical change. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
fb2d5b86 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: name case update method smpfs and ncpfs want to update a live dentry name in-place. Rather than have them open code the locking, provide a documented dcache API. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
fe15ce44 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: change d_delete semantics Change d_delete from a dentry deletion notification to a dentry caching advise, more like ->drop_inode. Require it to be constant and idempotent, and not take d_lock. This is how all existing filesystems use the callback anyway. This makes fine grained dentry locking of dput and dentry lru scanning much simpler. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
3e880fb5 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: use fast counters for vfs caches percpu_counter library generates quite nasty code, so unless you need to dynamically allocate counters or take fast approximate value, a simple per cpu set of counters is much better. The percpu_counter can never be made to work as well, because it has an indirection from pointer to percpu memory, and it can't use direct this_cpu_inc interfaces because it doesn't use static PER_CPU data, so code will always be worse. In the fastpath, it is the difference between this: incl %gs:nr_dentry # nr_dentry and this: movl percpu_counter_batch(%rip), %edx # percpu_counter_batch, movl $1, %esi #, movq $nr_dentry, %rdi #, call __percpu_counter_add # (plus I clobber registers) __percpu_counter_add: pushq %rbp # movq %rsp, %rbp #, subq $32, %rsp #, movq %rbx, -24(%rbp) #, movq %r12, -16(%rbp) #, movq %r13, -8(%rbp) #, movq %rdi, %rbx # fbc, fbc #APP # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1 movq %gs:kernel_stack,%rax #, pfo_ret__ # 0 "" 2 #NO_APP incl -8124(%rax) # <variable>.preempt_count movq 32(%rdi), %r12 # <variable>.counters, tcp_ptr__ #APP # 78 "lib/percpu_counter.c" 1 add %gs:this_cpu_off, %r12 # this_cpu_off, tcp_ptr__ # 0 "" 2 #NO_APP movslq (%r12),%r13 #* tcp_ptr__, tmp73 movslq %edx,%rax # batch, batch addq %rsi, %r13 # amount, count cmpq %rax, %r13 # batch, count jge .L27 #, negl %edx # tmp76 movslq %edx,%rdx # tmp76, tmp77 cmpq %rdx, %r13 # tmp77, count jg .L28 #, .L27: movq %rbx, %rdi # fbc, call _raw_spin_lock # addq %r13, 8(%rbx) # count, <variable>.count movq %rbx, %rdi # fbc, movl $0, (%r12) #,* tcp_ptr__ call _raw_spin_unlock # .L29: #APP # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1 movq %gs:kernel_stack,%rax #, pfo_ret__ # 0 "" 2 #NO_APP decl -8124(%rax) # <variable>.preempt_count movq -8136(%rax), %rax #, D.14625 testb $8, %al #, D.14625 jne .L32 #, .L31: movq -24(%rbp), %rbx #, movq -16(%rbp), %r12 #, movq -8(%rbp), %r13 #, leave ret .p2align 4,,10 .p2align 3 .L28: movl %r13d, (%r12) # count,* jmp .L29 # .L32: call preempt_schedule # .p2align 4,,6 jmp .L31 # .size __percpu_counter_add, .-__percpu_counter_add .p2align 4,,15 Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
86c8749e |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
vfs: revert per-cpu nr_unused counters for dentry and inodes The nr_unused counters count the number of objects on an LRU, and as such they are synchronized with LRU object insertion and removal and scanning, and protected under the LRU lock. Making it per-cpu does not actually get any concurrency improvements because of this lock, and summing the counter is much slower, and incrementing/decrementing it costs more code size and is slower too. These counters should stay per-LRU, which currently means global. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
786a5e15 |
|
06-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
fs: d_validate fixes d_validate has been broken for a long time. kmem_ptr_validate does not guarantee that a pointer can be dereferenced if it can go away at any time. Even rcu_read_lock doesn't help, because the pointer might be queued in RCU callbacks but not executed yet. So the parent cannot be checked, nor the name hashed. The dentry pointer can not be touched until it can be verified under lock. Hashing simply cannot be used. Instead, verify the parent/child relationship by traversing parent's d_child list. It's slow, but only ncpfs and the destaged smbfs care about it, at this point. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
d3a23e16 |
|
05-Jan-2011 |
Nick Piggin <npiggin@kernel.dk> |
Revert "fs: use RCU read side protection in d_validate" This reverts commit 3825bdb7ed920845961f32f364454bee5f469abb. You cannot dget() a dentry without having a reference, or holding a lock that guarantees it remains valid. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
#
3825bdb7 |
|
10-Oct-2010 |
Christoph Hellwig <hch@infradead.org> |
fs: use RCU read side protection in d_validate d_validate does a purely read lookup in the dentry hash, so use RCU read side locking instead of dcache_lock. Split out from a larget patch by Nick Piggin <npiggin@suse.de>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a4633357 |
|
10-Oct-2010 |
Christoph Hellwig <hch@infradead.org> |
fs: clean up dentry lru modification Always do a list_del_init on the LRU to make sure the list_empty invariant for not beeing on the LRU always holds true, and fold dentry_lru_del_init into dentry_lru_del. Replace the dentry_lru_add_tail primitive with a dentry_lru_move_tail operations that simpler when the dentry already is one the list, which is always is. Move the list_empty into dentry_lru_add to fit the scheme of the other lru helpers, and simplify locking once we move to a separate LRU lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3049cfe2 |
|
10-Oct-2010 |
Christoph Hellwig <hch@infradead.org> |
fs: split __shrink_dcache_sb Currently __shrink_dcache_sb has an extremly awkward calling convention because it tries to please very different callers. Split out the main loop into a shrink_dentry_list helper, which gets called directly from shrink_dcache_sb for the cases where all dentries need to be pruned, or from __shrink_dcache_sb for pruning only a certain number of dentries. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
265ac902 |
|
10-Oct-2010 |
Nick Piggin <npiggin@suse.de> |
fs: improve DCACHE_REFERENCED usage dentry referenced bit is only set when installing the dentry back onto the LRU. However with lazy LRU, the dentry can already be on the LRU list at dput time, thus missing out on setting the referenced bit. Fix this. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
312d3ca8 |
|
10-Oct-2010 |
Christoph Hellwig <hch@infradead.org> |
fs: use percpu counter for nr_dentry and nr_dentry_unused The nr_dentry stat is a globally touched cacheline and atomic operation twice over the lifetime of a dentry. It is used for the benfit of userspace only. Turn it into a per-cpu counter and always decrement it in d_free instead of doing various batching operations to reduce lock hold times in the callers. Based on an earlier patch from Nick Piggin <npiggin@suse.de>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9c82ab9c |
|
10-Oct-2010 |
Christoph Hellwig <hch@infradead.org> |
fs: simplify __d_free Remove d_callback and always call __d_free with a RCU head. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
be148247 |
|
10-Oct-2010 |
Christoph Hellwig <hch@infradead.org> |
fs: take dcache_lock inside __d_path All callers take dcache_lock just around the call to __d_path, so take the lock into it in preparation of getting rid of dcache_lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
99b7db7b |
|
17-Aug-2010 |
Nick Piggin <npiggin@kernel.dk> |
fs: brlock vfsmount_lock fs: brlock vfsmount_lock Use a brlock for the vfsmount lock. It must be taken for write whenever modifying the mount hash or associated fields, and may be taken for read when performing mount hash lookups. A new lock is added for the mnt-id allocator, so it doesn't need to take the heavy vfsmount write-lock. The number of atomics should remain the same for fastpath rlock cases, though code would be slightly slower due to per-cpu access. Scalability is not not be much improved in common cases yet, due to other locks (ie. dcache_lock) getting in the way. However path lookups crossing mountpoints should be one case where scalability is improved (currently requiring the global lock). The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node Altix system (high latency to remote nodes), a simple umount microbenchmark (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it took 6.8s, afterwards took 7.1s, about 5% slower. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b04f784e |
|
17-Aug-2010 |
Nick Piggin <npiggin@kernel.dk> |
fs: remove extra lookup in __lookup_hash fs: remove extra lookup in __lookup_hash Optimize lookup for create operations, where no dentry should often be common-case. In cases where it is not, such as unlink, the added overhead is much smaller than the removed. Also, move comments about __d_lookup racyness to the __d_lookup call site. d_lookup is intuitive; __d_lookup is what needs commenting. So in that same vein, add kerneldoc comments to __d_lookup and clean up some of the comments: - We are interested in how the RCU lookup works here, particularly with renames. Make that explicit, and point to the document where it is explained in more detail. - RCU is pretty standard now, and macros make implementations pretty mindless. If we want to know about RCU barrier details, we look in RCU code. - Delete some boring legacy comments because we don't care much about how the code used to work, more about the interesting parts of how it works now. So comments about lazy LRU may be interesting, but would better be done in the LRU or refcount management code. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
cd956a1c |
|
14-Aug-2010 |
Randy Dunlap <randy.dunlap@oracle.com> |
fs/dcache: fix function param name in kernel-doc Fix parameter name in kernel-doc notation (causes a warning). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8df9d1a4 |
|
10-Aug-2010 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: show unreachable paths in getcwd and proc Prepend "(unreachable)" to path strings if the path is not reachable from the current root. Two places updated are - the return string from getcwd() - and symlinks under /proc/$PID. Other uses of d_path() are left unchanged (we know that some old software crashes if /proc/mounts is changed). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ffd1f4ed |
|
10-Aug-2010 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: only add " (deleted)" where necessary __d_path() has 4 callers: d_path() sys_getcwd() seq_path_root() tomoyo_realpath_from_path2() Of these the only one which needs the " (deleted)" ending is d_path(). sys_getcwd() checks for existence before calling __d_path(). seq_path_root() is used to show the mountpoint path in /proc/PID/mountinfo, which is always a positive. And tomoyo doesn't want the deleted ending. Create a helper "path_with_deleted()" as subsequent patches will need this in multiple places. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f2eb6575 |
|
10-Aug-2010 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: add prepend_path() helper Split off prepend_path() from __d_path(). This new helper takes an end-of-buffer pointer and buffer-length pointer just like the other prepend_* functions. Move the " (deleted)" postfix out to __d_path(). This patch doesn't change any functionality but paves the way for the following patches. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
98dc568b |
|
10-Aug-2010 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: __d_path: dont prepend the name of the root dentry In the old times pseudo-filesystems set the name of theroot dentry to some prefix like "pipe:" and the name of the child dentry to "[123]" and relied on a hack in __d_path() to replace the preceding slash with the root's name to get "pipe:[123]". Then the d_dname() dentry operation was introduced which solved the same problem without having to pre-fill the name in each dentry. Currently the following pseudo filesystems exist in the kernel: perfmon mtd anon_inode bdev pipe socket Of these only perfmon, anon_inode, pipe and socket create sub-dentries, all of which have now been switched to using d_dname(). bdev and mtd only create inodes. This means that now the hack to overwrite the slash can be removed, so for unreachable paths (e.g. within a detached mount) the path string won't be polluted with garbage. For these cases a subsequent patch will add a prefix, indicating that the path is unreachable. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f7ad3c6b |
|
10-Aug-2010 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: add helpers to get root and pwd Add three helpers that retrieve a refcounted copy of the root and cwd from the supplied fs_struct. get_fs_root() get_fs_pwd() get_fs_root_and_pwd() Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
dca33252 |
|
24-Jul-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
no need for list_for_each_entry_safe()/resetting with superblock list just delay __put_super() a bit Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c103135c |
|
06-Jun-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: __dentry_path() builds path relative to fs root, called under dcache_lock, doesn't append any nonsense to unlinked ones. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7f8275d0 |
|
18-Jul-2010 |
Dave Chinner <dchinner@redhat.com> |
mm: add context argument to shrinker callback The current shrinker implementation requires the registered callback to have global state to work from. This makes it difficult to shrink caches that are not global (e.g. per-filesystem caches). Pass the shrinker structure to the callback so that users can embed the shrinker structure in the context the shrinker needs to operate on and get back to it in the callback via container_of(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
57439f87 |
|
23-Jun-2010 |
npiggin@suse.de <npiggin@suse.de> |
fs: fix superblock iteration race list_for_each_entry_safe is not suitable to protect against concurrent modification of the list. 6754af6 introduced a race in sb walking. list_for_each_entry can use the trick of pinning the current entry in the list before we drop and retake the lock because it subsequently follows cur->next. However list_for_each_entry_safe saves n=cur->next for following before entering the loop body, so when the lock is dropped, n may be deleted. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: John Stultz <johnstul@us.ibm.com> Cc: Frank Mayhar <fmayhar@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
79893c17 |
|
22-Mar-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
fix prune_dcache()/umount() race ... and get rid of the last __put_super_and_need_restart() caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
551de6f3 |
|
22-Mar-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
Leave superblocks on s_list until the end We used to remove from s_list and s_instances at the same time. So let's *not* do the former and skip superblocks that have empty s_instances in the loops over s_list. The next step, of course, will be to get rid of rescan logics in those loops. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
13e3c5e5 |
|
21-May-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
clean DCACHE_CANT_MOUNT in d_delete() We set the "it's dead, don't mount on it" flag _and_ do not remove it if we turn the damn thing negative and leave it around. And if it goes positive afterwards, well... Fortunately, there's only one place where that needs to be caught: only d_delete() can turn the sucker negative without immediately freeing it; all other places that can lead to ->d_iput() call are followed by unconditionally freeing struct dentry in question. So the fix is obvious: Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16014 Reported-by: Adam Tkac <vonsch@gmail.com> Tested-by: Adam Tkac <vonsch@gmail.com> Cc: <stable@kernel.org> [2.6.34.x] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4919c5e4 |
|
03-Mar-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
fix race in d_splice_alias() rehashing the negative placeholder opens a race with d_lookup(); we unhash it almost immediately (by d_move()), but the race window is there. Since d_move() doesn't rely on target being hashed, we don't need that d_rehash() at all. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2096f759 |
|
30-Jan-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
New helper: path_is_under(path1, path2) Analog of is_subdir for vfsmount,dentry pairs, moved from audit_tree.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ec4f8605 |
|
05-Jan-2010 |
H Hartley Sweeten <hartleys@visionengravers.com> |
fs/dcache.c: CodingStyle cleanup Cleanup EXPORT* macros according to Documantation/CodingStyle. Move EXPORT* macros to the line immediately after the closing function brace. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ef26ca97 |
|
29-Sep-2009 |
H Hartley Sweeten <hartleys@visionengravers.com> |
libfs: move EXPORT_SYMBOL for d_alloc_name The EXPORT_SYMBOL for d_alloc_name is in fs/libfs.c but the function is in fs/dcache.c. Move the EXPORT_SYMBOL to the line immediately after the closing function brace line in fs/dcache.c as mentioned in Documentation/CodingStyle. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
613afbf8 |
|
16-Jul-2009 |
Frederic Weisbecker <fweisbec@gmail.com> |
sched: Pull up the might_sleep() check into cond_resched() might_sleep() is called late-ish in cond_resched(), after the need_resched()/preempt enabled/system running tests are checked. It's better to check the sleeps while atomic earlier and not depend on some environment datas that reduce the chances to detect a problem. Also define cond_resched_*() helpers as macros, so that the FILE/LINE reported in the sleeping while atomic warning displays the real origin and not sched.h Changes in v2: - Call __might_sleep() directly instead of might_sleep() which may call cond_resched() - Turn cond_resched() into a macro so that the file:line couple reported refers to the caller of cond_resched() and not __cond_resched() itself. Changes in v3: - Also propagate this __might_sleep() pull up to cond_resched_lock() and cond_resched_softirq() Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1247725694-6082-6-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
f3da392e |
|
03-May-2009 |
Alexey Dobriyan <adobriyan@gmail.com> |
dcache: extrace and use d_unlinked() d_unlinked() will be used in middle-term to ban checkpointing when opened but unlinked file is detected, and in long term, to detect such situation and special case on it. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c490d79b |
|
26-Apr-2009 |
npiggin@suse.de <npiggin@suse.de> |
fs: dcache fix LRU ordering Fix ordering of LRU when moving referenced dentries to the head of the list (they should go to the head of the list in the same order as they were found from the tail, rather than reverse order). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
24b6f16e |
|
18-Apr-2009 |
Al Viro <viro@zeniv.linux.org.uk> |
No need for crossing to mountpoint in audit_tag_tree() is_under() will DTRT anyway. And yes, is_subdir() behaviour is intentional. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e5824c97 |
|
29-Mar-2009 |
Al Viro <viro@zeniv.linux.org.uk> |
Trim includes of fdtable.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5ad4e53b |
|
29-Mar-2009 |
Al Viro <viro@zeniv.linux.org.uk> |
Get rid of indirect include of fs_struct.h Don't pull it in sched.h; very few files actually need it and those can include directly. sched.h itself only needs forward declaration of struct fs_struct; Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b6520c81 |
|
05-Jan-2009 |
Christoph Hellwig <hch@lst.de> |
cleanup d_add_ci Make sure that comments describe what's going on and not how, and always use __d_instantiate instead of two separate branches, one with d_instantiate and one with __d_instantiate. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
adc48720 |
|
27-Feb-2009 |
Benny Halevy <bhalevy@panasas.com> |
EXPORT_SYMBOL(d_obtain_alias) rather than EXPORT_SYMBOL_GPL Commit 4ea3ada2955e4519befa98ff55dd62d6dfbd1705 declares d_obtain_alias() as EXPORT_SYMBOL_GPL where it's supposed to replace d_alloc_anon which was previously declared as EXPORT_SYMBOL and thus available to any loadable module. This patch reverts that. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3cdad428 |
|
14-Jan-2009 |
Heiko Carstens <hca@linux.ibm.com> |
[CVE-2009-0029] System call wrappers part 20 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
#
9a8d5bb4 |
|
07-Jan-2009 |
Wu Fengguang <fengguang.wu@intel.com> |
generic swap(): dcache: use swap() instead of private do_switch() Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b6b3fdea |
|
10-Dec-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
filp_cachep can be static in fs/file_table.c Instead of creating the "filp" kmem_cache in vfs_caches_init(), we can do it a litle be later in files_init(), so that filp_cachep is static to fs/file_table.c Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
52afeefb |
|
01-Dec-2008 |
Arjan van de Ven <arjan@infradead.org> |
expand some comments (d_path / seq_path) Explain that you really need to use the return value of d_path rather than the buffer you passed into it. Also fix the comment for seq_path(), the function arguments changed recently but the comment hadn't been updated in sync. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
be42c4c4 |
|
01-Dec-2008 |
Zhaolei <zhaolei@cn.fujitsu.com> |
correct wrong function name of d_put in kernel document and source comment no function named d_put(), it should be dput(). Impact: fix document and comment, no functionality changed Signed-off-by: Zhao Lei <zhaolei@cn.fuijtsu.com> Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
dc711ca3 |
|
03-Nov-2008 |
Al Viro <viro@zeniv.linux.org.uk> |
fix switch_names() breakage in short-to-short case We want ->name.len to match the resulting name on *both* source and target Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c2452f32 |
|
01-Dec-2008 |
Nick Piggin <npiggin@suse.de> |
shrink struct dentry struct dentry is one of the most critical structures in the kernel. So it's sad to see it going neglected. With CONFIG_PROFILING turned on (which is probably the common case at least for distros and kernel developers), sizeof(struct dcache) == 208 here (64-bit). This gives 19 objects per slab. I packed d_mounted into a hole, and took another 4 bytes off the inline name length to take the padding out from the end of the structure. This shinks it to 200 bytes. I could have gone the other way and increased the length to 40, but I'm aiming for a magic number, read on... I then got rid of the d_cookie pointer. This shrinks it to 192 bytes. Rant: why was this ever a good idea? The cookie system should increase its hash size or use a tree or something if lookups are a problem. Also the "fast dcookie lookups" in oprofile should be moved into the dcookie code -- how can oprofile possibly care about the dcookie_mutex? It gets dropped after get_dcookie() returns so it can't be providing any sort of protection. At 192 bytes, 21 objects fit into a 4K page, saving about 3MB on my system with ~140 000 entries allocated. 192 is also a multiple of 64, so we get nice cacheline alignment on 64 and 32 byte line systems -- any given dentry will now require 3 cachelines to touch all fields wheras previously it would require 4. I know the inline name size was chosen quite carefully, however with the reduction in cacheline footprint, it should actually be just about as fast to do a name lookup for a 36 character name as it was before the patch (and faster for other sizes). The memory footprint savings for names which are <= 32 or > 36 bytes long should more than make up for the memory cost for 33-36 byte names. Performance is a feature... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fd217f4d |
|
21-Oct-2008 |
Arjan van de Ven <arjan@infradead.org> |
[PATCH] fs: add a sanity check in d_free Hi Al, remember that debug session we did at KS? You suggested this patch back then.... From 7751eaf30474b8cbfaea64795805a17eab05ac53 Mon Sep 17 00:00:00 2001 From: Arjan van de Ven <arjan@linux.intel.com> Date: Tue, 16 Sep 2008 16:51:17 -0700 Subject: [PATCH] fs: add a sanity check in d_free we're seeing some corruption in the dentry->d_alias list that appears like a free of an entry still on the list; this patch adds a WARN_ON() to catch this scenario, as suggested by Al Viro Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
|
#
5cec56de |
|
13-Oct-2008 |
Qinghuang Feng <qhfeng.kernel@gmail.com> |
[PATCH] fs/dcache.c: update comment of d_validate() Parameters @hash and @len have been removed since 2.4.3, now just to delete them. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
|
#
8f3dfaa5 |
|
15-Oct-2008 |
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> |
[PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate() This calls d_move(), so fsnotify_d_instantiate() is unnecessary like rename path. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
|
#
360da900 |
|
15-Oct-2008 |
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> |
[PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper This adds __d_instantiate() for users which is already taking dcache_lock, and replace with it. The part of d_add_ci() isn't equivalent. But it should be needed fsnotify_d_instantiate() actually, because the path is to add the inode to negative dentry. fsnotify_d_instantiate() should be called after change from negative to positive. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
|
#
e2761a11 |
|
15-Oct-2008 |
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> |
[PATCH vfs-2.6 2/6] vfs: add d_ancestor() This adds d_ancestor() instead of d_isparent(), then use it. If new_dentry == old_dentry, is_subdir() returns 1, looks strange. "new_dentry == old_dentry" is not subdir obviously. But I'm not checking callers for now, so this keeps current behavior. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
|
#
871c0067 |
|
15-Oct-2008 |
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> |
[PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT() Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
|
#
9308a612 |
|
11-Aug-2008 |
Christoph Hellwig <hch@lst.de> |
[PATCH] kill d_alloc_anon Remove d_alloc_anon now that no users are left. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
44003728 |
|
11-Aug-2008 |
Christoph Hellwig <hch@lst.de> |
[PATCH] switch all filesystems over to d_obtain_alias Switch all users of d_alloc_anon to d_obtain_alias. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4ea3ada2 |
|
11-Aug-2008 |
Christoph Hellwig <hch@lst.de> |
[PATCH] new helper: d_obtain_alias The calling conventions of d_alloc_anon are rather unfortunate for all users, and it's name is not very descriptive either. Add d_obtain_alias as a new exported helper that drops the inode reference in the failure case, too and allows to pass-through NULL pointers and inodes to allow for tail-calls in the export operations. Incidentally this helper already existed as a private function in libfs.c as exportfs_d_alloc so kill that one and switch the callers to d_obtain_alias. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d0185c08 |
|
29-Sep-2008 |
Linus Torvalds <torvalds@linux-foundation.org> |
Fix NULL pointer dereference in proc_sys_compare The VFS interface for the 'd_compare()' is a bit special (read: 'odd'), because it really just essentially replaces a memcmp(). The filesystem is supposed to just compare the two names with whatever case-independent or other function. And when I say 'is supposed to', I obviously mean that 'procfs does odd things, and actually looks at the dentry that we don't even pass down, rather than just the name'. Which results in problems, because we actually call d_compare before we have even verified that the dentry is still hashed at all. And that causes a problm since the inode that procfs looks at may have been free'd and the d_inode pointer is NULL. procfs just assumes that all dentries are positive, since procfs itself never generates a negative one. But memory pressure will still result in the dentry getting torn down, and as it is removed by RCU, it still remains visible on some lists - and to d_compare. If the filesystem just did a name comparison, we wouldn't care. And we could just fix procfs to know about negative dentries too. But rather than have the low-level filesystems know about internal VFS details, just move the check for a unhashed dentry up a bit, so that we will only call d_compare on dentries that are still active. The actual oops this caused didn't look like a NULL pointer dereference because procfs did a 'container_of(inode, struct proc_inode, vfs_inode)' to get at its internal proc_inode information from the inode pointer, and accessed a field below the inode. So the oops would look something like BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50 and was seen on both x86-64 (Alexey Dobriyan and Hugh Dickins) and ppc64 (Hugh Dickins). Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-of-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e45b590b |
|
07-Aug-2008 |
Christoph Hellwig <hch@lst.de> |
[PATCH] change d_add_ci argument ordering As pointed out during review d_add_ci argument order should match d_add, so switch the dentry and inode arguments. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9403540c |
|
21-May-2008 |
Barry Naujok <bnaujok@sgi.com> |
dcache: Add case-insensitive support d_ci_add() routine This add a dcache entry to the dcache for lookup, but changing the name that is associated with the entry rather than the one passed in to the lookup routine. First, it sees if the case-exact match already exists in the dcache and uses it if one exists. Otherwise, it allocates a new node with the new name and splices it into the dcache. Original code from ntfs_lookup in fs/ntfs/namei.c by Anton Altaparmakov. Signed-off-by: Barry Naujok <bnaujok@sgi.com> Signed-off-by: Anton Altaparmakov <aia21@cantab.net> Acked-by: Christoph Hellwig <hch@infradead.org>
|
#
f3c6ba98 |
|
25-Jul-2008 |
Kentaro Makita <k-makita@np.css.fujitsu.com> |
vfs: add cond_resched_lock while scanning dentry LRU lists Add cond_resched_lock(&dcache_lock) while scanning LRU lists on superblocks in __shrink_dcache_sb() Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
da3bbdd4 |
|
23-Jul-2008 |
Kentaro Makita <k-makita@np.css.fujitsu.com> |
fix soft lock up at NFS mount via per-SB LRU-list of unused dentries [Summary] Split LRU-list of unused dentries to one per superblock to avoid soft lock up during NFS mounts and remounting of any filesystem. Previously I posted here: http://lkml.org/lkml/2008/3/5/590 [Descriptions] - background dentry_unused is a list of dentries which are not referenced. dentry_unused grows up when references on directories or files are released. This list can be very long if there is huge free memory. - the problem When shrink_dcache_sb() is called, it scans all dentry_unused linearly under spin_lock(), and if dentry->d_sb is differnt from given superblock, scan next dentry. This scan costs very much if there are many entries, and very ineffective if there are many superblocks. IOW, When we need to shrink unused dentries on one dentry, but scans unused dentries on all superblocks in the system. For example, we scan 500 dentries to unmount a filesystem, but scans 1,000,000 or more unused dentries on other superblocks. In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink unused dentries on NFS, but scans 100,000,000 unused dentries on superblocks in the system such as local ext3 filesystems. I hear NFS mounting took 1 min on some system in use. * : NFS uses virtual filesystem in rpc layer, so NFS is affected by this problem. 100,000,000 is possible number on large systems. Per-superblock LRU of unused dentried can reduce the cost in reasonable manner. - How to fix I found this problem is solved by David Chinner's "Per-superblock unused dentry LRU lists V3"(1), so I rebase it and add some fix to reclaim with fairness, which is in Andrew Morton's comments(2). 1) http://lkml.org/lkml/2006/5/25/318 2) http://lkml.org/lkml/2006/5/25/320 Split LRU-list of unused dentries to each superblocks. Then, NFS mounting will check dentries under a superblock instead of all. But this spliting will break LRU of dentry-unused. So, I've attempted to make reclaim unused dentrins with fairness by calculate number of dentries to scan on this sb based on following way number of dentries to scan on this sb = count * (number of dentries on this sb / number of dentries in the machine) - ToDo - I have to measuring performance number and do stress tests. - When unmount occurs during prune_dcache(), scanning on same superblock, It is unable to reach next superblock because it is gone away. We restart scannig superblock from first one, it causes unfairness of reclaim unused dentries on first superblock. But I think this happens very rarely. - Test Results Result on 6GB boxes with excessive unused dentries. Without patch: $ cat /proc/sys/fs/dentry-state 10181835 10180203 45 0 0 0 # mount -t nfs 10.124.60.70:/work/kernel-src nfs real 0m1.830s user 0m0.001s sys 0m1.653s With this patch: $ cat /proc/sys/fs/dentry-state 10236610 10234751 45 0 0 0 # mount -t nfs 10.124.60.70:/work/kernel-src nfs real 0m0.106s user 0m0.002s sys 0m0.032s [akpm@linux-foundation.org: fix comments] Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: David Chinner <dgc@sgi.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cdd16d02 |
|
23-Jun-2008 |
Miklos Szeredi <mszeredi@suse.cz> |
[patch 2/3] vfs: dcache cleanups Comment from Al Viro: add prepend_name() wrapper. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
31f3e0b3 |
|
23-Jun-2008 |
Miklos Szeredi <mszeredi@suse.cz> |
[patch 1/3] vfs: dcache sparse fixes Fix the following sparse warnings: fs/dcache.c:2183:19: warning: symbol 'filp_cachep' was not declared. Should it be static? fs/dcache.c:115:3: warning: context imbalance in 'dentry_iput' - unexpected unlock fs/dcache.c:188:2: warning: context imbalance in 'dput' - different lock contexts for basic block fs/dcache.c:400:2: warning: context imbalance in 'prune_one_dentry' - different lock contexts for basic block fs/dcache.c:431:22: warning: context imbalance in 'prune_dcache' - different lock contexts for basic block fs/dcache.c:563:2: warning: context imbalance in 'shrink_dcache_sb' - different lock contexts for basic block fs/dcache.c:1385:6: warning: context imbalance in 'd_delete' - wrong count at exit fs/dcache.c:1636:2: warning: context imbalance in '__d_unalias' - unexpected unlock fs/dcache.c:1735:2: warning: context imbalance in 'd_materialise_unique' - different lock contexts for basic block Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
be285c71 |
|
16-Jun-2008 |
Andreas Gruenbacher <agruen@suse.de> |
[patch 3/3] vfs: make d_path() consistent across mount operations The path that __d_path() computes can become slightly inconsistent when it races with mount operations: it grabs the vfsmount_lock when traversing mount points but immediately drops it again, only to re-grab it when it reaches the next mount point. The result is that the filename computed is not always consisent, and the file may never have had that name. (This is unlikely, but still possible.) Fix this by grabbing the vfsmount_lock for the whole duration of __d_path(). Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: John Johansen <jjohansen@suse.de> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
20d4fdc1 |
|
09-Jun-2008 |
Jan Engelhardt <jengelh@medozas.de> |
[patch 2/4] fs: make struct file arg to d_path const Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9d1bc601 |
|
27-Mar-2008 |
Miklos Szeredi <mszeredi@suse.cz> |
[patch 2/7] vfs: mountinfo: add seq_file_root() Add a new function: seq_file_root() This is similar to seq_path(), but calculates the path relative to the given root, instead of current->fs->root. If the path was unreachable from root, then modify the root parameter to reflect this. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6092d048 |
|
27-Mar-2008 |
Ram Pai <linuxram@us.ibm.com> |
[patch 1/7] vfs: mountinfo: add dentry_path() [mszeredi@suse.cz] split big patch into managable chunks Add the following functions: dentry_path() seq_dentry() These are similar to d_path() and seq_path(). But instead of calculating the path within a mount namespace, they calculate the path from the root of the filesystem to a given dentry, ignoring mounts completely. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4a0962ab |
|
14-Feb-2008 |
Christoph Lameter <clameter@sgi.com> |
dentries: Extract common code to remove dentry from lru Extract the common code to remove a dentry from the lru into a new function dentry_lru_remove(). Two call sites used list_del() instead of list_del_init(). AFAIK the performance of both is the same. dentry_lru_remove() does a list_del_init(). As a result dentry->d_lru is now always empty when a dentry is freed. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cf28b486 |
|
14-Feb-2008 |
Jan Blunck <jblunck@suse.de> |
d_path: Make d_path() use a struct path d_path() is used on a <dentry,vfsmount> pair. Lets use a struct path to reflect this. [akpm@linux-foundation.org: fix build in mm/memory.c] Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Bryan Wu <bryan.wu@analog.com> Acked-by: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a03a8a70 |
|
14-Feb-2008 |
Jan Blunck <jblunck@suse.de> |
d_path: kerneldoc cleanup Move and update d_path() kernel API documentation. Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
329c97f0 |
|
14-Feb-2008 |
Jan Blunck <jblunck@suse.de> |
One less parameter to __d_path All callers to __d_path pass the dentry and vfsmount of a struct path to __d_path. Pass the struct path directly, instead. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6ac08c39 |
|
14-Feb-2008 |
Jan Blunck <jblunck@suse.de> |
Use struct path in fs_struct * Use struct path in fs_struct. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0d71bd59 |
|
06-Feb-2008 |
Nick Piggin <npiggin@suse.de> |
inotify: remove debug code The inotify debugging code is supposed to verify that the DCACHE_INOTIFY_PARENT_WATCHED scalability optimisation does not result in notifications getting lost nor extra needless locking generated. Unfortunately there are also some races in the debugging code. And it isn't very good at finding problems anyway. So remove it for now. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Robert Love <rlove@google.com> Cc: John McCutchan <ttb@tentacle.dhs.org> Cc: Jan Kara <jack@ucw.cz> Cc: Yan Zheng <yanzheng@21cn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e8462caa |
|
06-Feb-2008 |
Akinobu Mita <akinobu.mita@gmail.com> |
fs: use hlist_unhashed Use hlist_unhashed() instead of opencoded equivalent. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
321bcf92 |
|
21-Oct-2007 |
J. Bruce Fields <bfields@citi.umich.edu> |
dcache: don't expose uninitialized memory in /proc/<pid>/fd/<fd> Well, it's not especially important that target->d_iname get the contents of dentry->d_iname, but it's important that it get initialized with *something*, otherwise we're just exposing some random piece of memory to anyone who reads the link at /proc/<pid>/fd/<fd> for the deleted file, when it's still held open by someone. I've run a test program that copies a short (<36 character) name ontop of a long (>=36 character) name and see that the first time I run it, without this patch, I get unpredicatable results out of /proc/<pid>/fd/<fd>. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
74c3cbe3 |
|
22-Jul-2007 |
Al Viro <viro@zeniv.linux.org.uk> |
[PATCH] audit: watching subtrees New kind of audit rule predicates: "object is visible in given subtree". The part that can be sanely implemented, that is. Limitations: * if you have hardlink from outside of tree, you'd better watch it too (or just watch the object itself, obviously) * if you mount something under a watched tree, tell audit that new chunk should be added to watched subtrees * if you umount something in a watched tree and it's still mounted elsewhere, you will get matches on events happening there. New command tells audit to recalculate the trees, trimming such sources of false positives. Note that it's _not_ about path - if something mounted in several places (multiple mount, bindings, different namespaces, etc.), the match does _not_ depend on which one we are using for access. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f77e3498 |
|
17-Oct-2007 |
Denis Cheng <crquan@gmail.com> |
vfs: use the predefined d_unhashed inline function instead Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
37c42524 |
|
17-Oct-2007 |
Denis V. Lunev <den@openvz.org> |
shrink_dcache_sb speedup This patch makes shrink_dcache_sb consistent with dentry pruning policy. On the first pass we iterate over dentry unused list and prepare some dentries for removal. However, since the existing code moves evicted dentries to the beginning of the LRU it can happen that fresh dentries from other superblocks will be inserted *before* our dentries. This can result in significant slowdown of shrink_dcache_sb(). Moreover, for virtual filesystems like unionfs which can call dput() during dentries kill existing code results in O(n^2) complexity. We observed 2 minutes shrink_dcache_sb() with only 35000 dentries. To avoid this effects we propose to isolate sb dentries at the end of LRU list. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrey Mirkin <amirkin@openvz.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bc154b1e |
|
17-Oct-2007 |
J. Bruce Fields <bfields@fieldses.org> |
dcache: trivial comment fix As it stands this comment is confusing, and not quite grammatical. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
85864e10 |
|
17-Oct-2007 |
Miklos Szeredi <mszeredi@suse.cz> |
clean out unused code in dentry pruning It looks like in the end all pruners want parents removed. So remove unused code and function arguments. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
74bf17cf |
|
17-Oct-2007 |
Denis Cheng <crquan@gmail.com> |
fs: remove the unused mempages parameter Since the mempages parameter is actually not used, they should be removed. Now there is only files_init use the mempages parameter, files_init(mempages); but I don't think the adaptation to mempages in files_init is really useful; and if files_init also changed to the prototype void (*func)(void), the wrapper vfs_caches_init would also not need the mempages parameter. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e12ba74d |
|
16-Oct-2007 |
Mel Gorman <mel@csn.ul.ie> |
Group short-lived and reclaimable kernel allocations This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
20c2df83 |
|
19-Jul-2007 |
Paul Mundt <lethal@linux-sh.org> |
mm: Remove slab destructors from kmem_cache_create(). Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
|
#
8e1f936b |
|
17-Jul-2007 |
Rusty Russell <rusty@rustcorp.com.au> |
mm: clean up and kernelify shrinker registration I can never remember what the function to register to receive VM pressure is called. I have to trace down from __alloc_pages() to find it. It's called "set_shrinker()", and it needs Your Help. 1) Don't hide struct shrinker. It contains no magic. 2) Don't allocate "struct shrinker". It's not helpful. 3) Call them "register_shrinker" and "unregister_shrinker". 4) Call the function "shrink" not "shrinker". 5) Reduce the 17 lines of waffly comments to 13, but document it properly. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: David Chinner <dgc@sgi.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e63340ae |
|
08-May-2007 |
Randy Dunlap <randy.dunlap@oracle.com> |
header cleaning: don't include smp_lock.h when not used Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c23fbb6b |
|
08-May-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
VFS: delay the dentry name generation on sockets and pipes 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to build a pathname for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes/sockets) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen); 2) Adds a dynamic_dname() helper function that eases d_dname() implementations 3) Defines d_dname method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... 4) Defines d_dname method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Christoph Hellwig <hch@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
24c32d73 |
|
08-May-2007 |
Andrew Morton <akpm@linux-foundation.org> |
mm: shrink parent dentries when shrinking slab Teach the dentry slab shrinker to aggressively shrink parent dentries when shrinking the dentry cache. This is done to attempt to improve the situation where the dentry slab cache gets a lot of internal fragmentation due to pages containing directory dentries. It is expected that this change will cause some of those dentries to be reaped earlier, and with less scanning. Needs careful testing. Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d52b9086 |
|
08-May-2007 |
Miklos Szeredi <mszeredi@suse.cz> |
fix quadratic behavior of shrink_dcache_parent() The time shrink_dcache_parent() takes, grows quadratically with the depth of the tree under 'parent'. This starts to get noticable at about 10,000. These kinds of depths don't occur normally, and filesystems which invoke shrink_dcache_parent() via d_invalidate() seem to have other depth dependent timings, so it's not even easy to expose this problem. However with FUSE it's easy to create a deep tree and d_invalidate() will also get called. This can make a syscall hang for a very long time. This is the original discovery of the problem by Russ Cox: http://article.gmane.org/gmane.comp.file-systems.fuse.devel/3826 The following patch fixes the quadratic behavior, by optionally allowing prune_dcache() to prune ancestors of a dentry in one go, instead of doing it one at a time. Common code in dput() and prune_one_dentry() is extracted into a new helper function d_kill(). shrink_dcache_parent() as well as shrink_dcache_sb() are converted to use the ancestry-pruner option. Only for shrink_dcache_memory() is this behavior not desirable, so it keeps using the old algorithm. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Maneesh Soni <maneesh@in.ibm.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0a31bd5f |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
KMEM_CACHE(): simplify slab cache creation This patch provides a new macro KMEM_CACHE(<struct>, <flags>) to simplify slab creation. KMEM_CACHE creates a slab with the name of the struct, with the size of the struct and with the alignment of the struct. Additional slab flags may be specified if necessary. Example struct test_slab { int a,b,c; struct list_head; } __cacheline_aligned_in_smp; test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC) will create a new slab named "test_slab" of the size sizeof(struct test_slab) and aligned to the alignment of test slab. If it fails then we panic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
552ce544 |
|
13-Feb-2007 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
Revert "[PATCH] Fix d_path for lazy unmounts" This reverts commit eb3dfb0cb1f4a44e2d0553f89514ce9f2a9fcaf1. It causes some strange Gnome problem with dbus-daemon getting stuck, so we'll revert it until that problem is understood. Reported by both walt and Greg KH, who both independently git-bisected the problem to this commit. Andreas is looking at it. Reported-by: walt <wa1ter@myrealbox.com> Reported-by: Greg KH <greg@kroah.com> Acked-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
eb3dfb0c |
|
12-Feb-2007 |
Andreas Gruenbacher <agruen@suse.de> |
[PATCH] Fix d_path for lazy unmounts Here is a bugfix to d_path. First, when d_path() hits a lazily unmounted mount point, it tries to prepend the name of the lazily unmounted dentry to the path name. It gets this wrong, and also overwrites the slash that separates the name from the following pathname component. This is demonstrated by the attached test case, which prints "getcwd returned d_path-bugsubdir" with the bug. The correct result would be "getcwd returned d_path-bug/subdir". It could be argued that the name of the root dentry should not be part of the result of d_path in the first place. On the other hand, what the unconnected namespace was once reachable as may provide some useful hints to users, and so that seems okay. Second, it isn't always possible to tell from the __d_path result whether the specified root and rootmnt (i.e., the chroot) was reached: lazy unmounts of bind mounts will produce a path that does start with a non-slash so we can tell from that, but other lazy unmounts will produce a path that starts with a slash, just like "ordinary" paths. The attached patch cleans up __d_path() to fix the bug with overlapping pathname components. It also adds a @fail_deleted argument, which allows to get rid of some of the mess in sys_getcwd(). Grabbing the dcache_lock can then also be moved into __d_path(). The patch also makes sure that paths will only start with a slash for paths which are connected to the root and rootmnt. The @fail_deleted argument could be added to d_path() as well: this would allow callers to recognize deleted files, without having to resort to the ambiguous check for the " (deleted)" string at the end of the pathnames. This is not currently done, but it might be worthwhile. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Cc: Neil Brown <neilb@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b3423415 |
|
06-Dec-2006 |
Eric Dumazet <dada1@cosmosbay.com> |
[PATCH] dcache: avoid RCU for never-hashed dentries Some dentries don't need to be globally visible in dentry hashtable. (pipes & sockets) Such dentries dont need to wait for a RCU grace period at delete time. Being able to free them permits a better CPU cache use (hot cache) This patch combined with (dont insert pipe dentries into dentry_hashtable) reduced time of { pipe(p); close(p[0]); close(p[1]);} on my UP machine (1.6 GHz Pentium-M) from 3.23 us to 2.86 us (But this patch does not depend on other patches, only bench results) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Maneesh Soni <maneesh@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
e18b890b |
|
06-Dec-2006 |
Christoph Lameter <clameter@sgi.com> |
[PATCH] slab: remove kmem_cache_t Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
f8713576 |
|
28-Oct-2006 |
David Howells <dhowells@redhat.com> |
[PATCH] VFS: Fix an error in unused dentry counting With Vasily Averin <vvs@sw.ru> Fix an error in unused dentry counting in shrink_dcache_for_umount_subtree() in which the count is modified without the dcache_lock held. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Vasily Averin <vvs@sw.ru> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
6eac3f93 |
|
28-Oct-2006 |
Vasily Averin <vvs@sw.ru> |
[PATCH] missing unused dentry in prune_dcache()? On the the following patch: http://linux.bkbits.net:8080/linux-2.6/gnupatch@449b144ecSF1rYskg3q-SeR2vf88zg # ChangeSet # 2006/06/22 15:05:57-07:00 neilb@suse.de # [PATCH] Fix dcache race during umount # If prune_dcache finds a dentry that it cannot free, it leaves it where it # is (at the tail of the list) and exits, on the assumption that some other # thread will be removing that dentry soon. However as far as I see this comment is not correct: when we cannot take s_umount rw_semaphore (for example because it was taken in do_remount) this dentry is already extracted from dentry_unused list and we do not add it into the list again. Therefore dentry will not be found by prune_dcache() and shrink_dcache_sb() and will leave in memory very long time until the partition will be unmounted. The patch adds this dentry into tail of the dentry_unused list. Signed-off-by: Vasily Averin <vvs@sw.ru> Cc: Neil Brown <neilb@suse.de> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
9eaef27b |
|
21-Oct-2006 |
Trond Myklebust <Trond.Myklebust@netapp.com> |
[PATCH] VFS: Make d_materialise_unique() enforce directory uniqueness If the caller tries to instantiate a directory using an inode that already has a dentry alias, then we attempt to rename the existing dentry instead of instantiating a new one. Fail with an ELOOP error if the rename would affect one of our parent directories. This behaviour is needed in order to avoid issues such as http://bugzilla.kernel.org/show_bug.cgi?id=7178 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Maneesh Soni <maneesh@in.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Neil Brown <neilb@cse.unsw.edu.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
c636ebdb |
|
11-Oct-2006 |
David Howells <dhowells@redhat.com> |
[PATCH] VFS: Destroy the dentries contributed by a superblock on unmounting The attached patch destroys all the dentries attached to a superblock in one go by: (1) Destroying the tree rooted at s_root. (2) Destroying every entry in the anon list, one at a time. (3) Each entry in the anon list has its subtree consumed from the leaves inwards. This reduces the amount of work generic_shutdown_super() does, and avoids iterating through the dentry_unused list. Note that locking is almost entirely absent in the shrink_dcache_for_umount*() functions added by this patch. This is because: (1) at the point the filesystem calls generic_shutdown_super(), it is not permitted to further touch the superblock's set of dentries, and nor may it remove aliases from inodes; (2) the dcache memory shrinker now skips dentries that are being unmounted; and (3) the superblock no longer has any external references through which the VFS can reach it. Given these points, the only locking we need to do is when we remove dentries from the unused list and the name hashes, which we do a directory's worth at a time. We also don't need to guard against reference counts going to zero unexpectedly and removing bits of the tree we're working on as nothing else can call dput(). A cut down version of dentry_iput() has been folded into shrink_dcache_for_umount_subtree() function. Apart from not needing to unlock things, it also doesn't need to check for inotify watches. In this version of the patch, the complaint about a dentry still being in use has been expanded from a single BUG_ON() and now gives much more information. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: NeilBrown <neilb@suse.de> Acked-by: Ian Kent <raven@themaw.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
21c0d8fd |
|
04-Oct-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] knfsd: close a race-opportunity in d_splice_alias There is a possible race in d_splice_alias. Though __d_find_alias(inode, 1) will only return a dentry with DCACHE_DISCONNECTED set, it is possible for it to get cleared before the BUG_ON, and it is is not possible to lock against that. There are a couple of problems here. Firstly, the code doesn't match the comment. The comment describes a 'disconnected' dentry as being IS_ROOT as well as DCACHE_DISCONNECTED, however there is not testing of IS_ROOT anythere. A dentry is marked DCACHE_DISCONNECTED when allocated with d_alloc_anon, and remains DCACHE_DISCONNECTED while a path is built up towards the root. So a dentry can have a valid name and a valid parent and even grandparent, but will still be DCACHE_DISCONNECTED until a path to the root is created. Once the path to the root is complete, everything in the path gets DCACHE_DISCONNECTED cleared. So the fact that DCACHE_DISCONNECTED isn't enough to say that a dentry is free to be spliced in with a given name. This can only be allowed if the dentry does not yet have a name, so the IS_ROOT test is needed too. However even adding that test to __d_find_alias isn't enough. As d_splice_alias drops dcache_lock before calling d_move to perform the splice, it could race with another thread calling d_splice_alias to splice the inode in with a different name in a different part of the tree (in the case where a file has hard links). So that splicing code is only really safe for directories (as we know that directories only have one link). For directories, the caller of d_splice_alias will be holding i_mutex on the (unique) parent so there is no room for a race. A consequence of this is that a non-directory will never benefit from being spliced into a pre-exisiting dentry, but that isn't a problem. It is perfectly OK for a non-directory to have multiple dentries, some anonymous, some not. And the comment for d_splice_alias says that it only happens for directories anyway. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
07f3f05c |
|
30-Sep-2006 |
David Howells <dhowells@redhat.com> |
[PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6] Create a new header file, fs/internal.h, for common definitions local to the sources in the fs/ directory. Move extern definitions that should be in header files from fs/*.c to fs/internal.h or other main header files where they span directories. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
770bfad8 |
|
22-Aug-2006 |
David Howells <dhowells@redhat.com> |
NFS: Add dentry materialisation op The attached patch adds a new directory cache management function that prepares a disconnected anonymous function to be connected into the dentry tree. The anonymous dentry is transferred the name and parentage from another dentry. The following changes were made in [try #2]: (*) d_materialise_dentry() now switches the parentage of the two nodes around correctly when one or other of them is self-referential. The following changes were made in [try #7]: (*) d_instantiate_unique() has had the interior part split out as function __d_instantiate_unique(). Callers of this latter function must be holding the appropriate locks. (*) _d_rehash() has been added as a wrapper around __d_rehash() to call it with the most obvious hash list (the one from the name). d_rehash() now calls _d_rehash(). (*) d_materialise_dentry() is now __d_materialise_dentry() and is static. (*) d_materialise_unique() added to perform the combination of d_find_alias(), d_materialise_dentry() and d_add_unique() that the NFS client was doing twice, all within a single dcache_lock critical section. This reduces the number of times two different spinlocks were being accessed. The following further changes were made: (*) Add the dentries onto their parents d_subdirs lists. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
#
a90b9c05 |
|
03-Jul-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] lockdep: annotate dcache Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
e4d91918 |
|
03-Jul-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] lockdep: locking init debugging improvement Locking init improvement: - introduce and use __SPIN_LOCK_UNLOCKED for array initializations, to pass in the name string of locks, used by debugging Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
6ab3d562 |
|
30-Jun-2006 |
Jörn Engel <joern@wohnheim.fh-wedel.de> |
Remove obsolete #include <linux/config.h> Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
#
1bfba4e8 |
|
26-Jun-2006 |
Akinobu Mita <mita@miraclelinux.com> |
[PATCH] core: use list_move() This patch converts the combination of list_del(A) and list_add(A, B) to list_move(A, B). Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
8e13059a |
|
26-Jun-2006 |
Akinobu Mita <mita@miraclelinux.com> |
[PATCH] use list_add_tail() instead of list_add() This patch converts list_add(A, B.prev) to list_add_tail(A, &B) for readability. Acked-by: Karsten Keil <kkeil@suse.de> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Acked-by: Jan Kara <jack@suse.cz> AOLed-by: David Woodhouse <dwmw2@infradead.org> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
f58a1ebb |
|
25-Jun-2006 |
Hua Zhong <hzhong@gmail.com> |
[PATCH] remove unlikely(sb) in prune_dcache likely profiling shows that the following is a miss. After boot: [+- ] Type | # True | # False | Function:Filename@Line +unlikely | 1074| 0 prune_dcache()@:fs/dcache.c@409 After a bonnie++ run: +unlikely | 66716| 19584 prune_dcache()@:fs/dcache.c@409 So remove it. Signed-off-by: Hua Zhong <hzhong@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
454e2398 |
|
23-Jun-2006 |
David Howells <dhowells@redhat.com> |
[PATCH] VFS: Permit filesystem to override root dentry on mount Extend the get_sb() filesystem operation to take an extra argument that permits the VFS to pass in the target vfsmount that defines the mountpoint. The filesystem is then required to manually set the superblock and root dentry pointers. For most filesystems, this should be done with simple_set_mnt() which will set the superblock pointer and then set the root dentry to the superblock's s_root (as per the old default behaviour). The get_sb() op now returns an integer as there's now no need to return the superblock pointer. This patch permits a superblock to be implicitly shared amongst several mount points, such as can be done with NFS to avoid potential inode aliasing. In such a case, simple_set_mnt() would not be called, and instead the mnt_root and mnt_sb would be set directly. The patch also makes the following changes: (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount pointer argument and return an integer, so most filesystems have to change very little. (*) If one of the convenience function is not used, then get_sb() should normally call simple_set_mnt() to instantiate the vfsmount. This will always return 0, and so can be tail-called from get_sb(). (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the dcache upon superblock destruction rather than shrink_dcache_anon(). This is required because the superblock may now have multiple trees that aren't actually bound to s_root, but that still need to be cleaned up. The currently called functions assume that the whole tree is rooted at s_root, and that anonymous dentries are not the roots of trees which results in dentries being left unculled. However, with the way NFS superblock sharing are currently set to be implemented, these assumptions are violated: the root of the filesystem is simply a dummy dentry and inode (the real inode for '/' may well be inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries with child trees. [*] Anonymous until discovered from another tree. (*) The documentation has been adjusted, including the additional bit of changing ext2_* into foo_* in the documentation. [akpm@osdl.org: convert ipath_fs, do other stuff] Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
d702ccb3 |
|
22-Jun-2006 |
Andrew Morton <akpm@osdl.org> |
[PATCH] prune_one_dentry() tweaks - Add description of d_lock handling to comments over prune_one_dentry(). - It has three callsites - uninline it, saving 200 bytes of text. Cc: Jan Blunck <jblunck@suse.de> Cc: Kirill Korotaev <dev@openvz.org> Cc: Olaf Hering <olh@suse.de> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
0feae5c4 |
|
22-Jun-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] Fix dcache race during umount The race is that the shrink_dcache_memory shrinker could get called while a filesystem is being unmounted, and could try to prune a dentry belonging to that filesystem. If it does, then it will call in to iput on the inode while the dentry is no longer able to be found by the umounting process. If iput takes a while, generic_shutdown_super could get all the way though shrink_dcache_parent and shrink_dcache_anon and invalidate_inodes without ever waiting on this particular inode. Eventually the superblock gets freed anyway and if the iput tried to touch it (which some filesystems certainly do), it will lose. The promised "Self-destruct in 5 seconds" doesn't lead to a nice day. The race is closed by holding s_umount while calling prune_one_dentry on someone else's dentry. As a down_read_trylock is used, shrink_dcache_memory will no longer try to prune the dentry of a filesystem that is being unmounted, and unmount will not be able to start until any such active prune_one_dentry completes. This requires that prune_dcache *knows* which filesystem (if any) it is doing the prune on behalf of so that it can be careful of other filesystems. shrink_dcache_memory isn't called it on behalf of any filesystem, and so is careful of everything. shrink_dcache_anon is now passed a super_block rather than the s_anon list out of the superblock, so it can get the s_anon list itself, and can pass the superblock down to prune_dcache. If prune_dcache finds a dentry that it cannot free, it leaves it where it is (at the tail of the list) and exits, on the assumption that some other thread will be removing that dentry soon. To try to make sure that some work gets done, a limited number of dnetries which are untouchable are skipped over while choosing the dentry to work on. I believe this race was first found by Kirill Korotaev. Cc: Jan Blunck <jblunck@suse.de> Acked-by: Kirill Korotaev <dev@openvz.org> Cc: Olaf Hering <olh@suse.de> Acked-by: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
3e7e241f |
|
31-Mar-2006 |
Eric W. Biederman <ebiederm@xmission.com> |
[PATCH] dcache: Add helper d_hash_and_lookup It is very common to hash a dentry and then to call lookup. If we take fs specific hash functions into account the full hash logic can get ugly. Further full_name_hash as an inline function is almost 100 bytes on x86 so having a non-inline choice in some cases can measurably decrease code size. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
7a2bd3f7 |
|
31-Mar-2006 |
Amy Griffis <amy.griffis@hp.com> |
[PATCH] inotify: IN_DELETE events missing IN_DELETE events are no longer generated for the removal of a file from a watched directory. This seems to be a result of clearing DCACHE_INOTIFY_PARENT_WATCHED in d_delete() directly before calling fsnotify_nameremove(). Assuming the flag doesn't need to be cleared before dentry_iput(), this should do the trick. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Cc: John McCutchan <ttb@tentacle.dhs.org> Acked-by: Robert Love <rml@novell.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
3be7d29f |
|
26-Mar-2006 |
Artem B. Bityuckiy <dedekind@infradead.org> |
Remove ugly debugging stuff Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
#
fa3536cc |
|
26-Mar-2006 |
Eric Dumazet <dada1@cosmosbay.com> |
[PATCH] Use __read_mostly on some hot fs variables I discovered on oprofile hunting on a SMP platform that dentry lookups were slowed down because d_hash_mask, d_hash_shift and dentry_hashtable were in a cache line that contained inodes_stat. So each time inodes_stats is changed by a cpu, other cpus have to refill their cache line. This patch moves some variables to the __read_mostly section, in order to avoid false sharing. RCU dentry lookups can go full speed. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
28133c7b |
|
26-Mar-2006 |
Eric Sesterhenn <snakebyte@gmx.de> |
BUG_ON() Conversion in fs/dcache.c this changes if() BUG(); constructs to BUG_ON() which is cleaner and can better optimized away Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
#
2ab13460 |
|
25-Mar-2006 |
Kirill Korotaev <dev@openvz.org> |
[PATCH] Reduce sched latency in shrink_dcache_sb() This patch reduces scheduling latency in shrink_dcache_sb() noticed during remounting of big partitions with many cached dentries. The same latency fix was applied to select_parent() long ago. Signed-off-by: Denis Lunev <den@sw.ru> Signed-off-by: Pavel Emelianov <xemul@sw.ru> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
c32ccd87 |
|
25-Mar-2006 |
Nick Piggin <nickpiggin@yahoo.com.au> |
[PATCH] inotify: lock avoidance with parent watch status in dentry Previous inotify work avoidance is good when inotify is completely unused, but it breaks down if even a single watch is in place anywhere in the system. Robin Holt notices that udev is one such culprit - it slows down a 512-thread application on a 512 CPU system from 6 seconds to 22 minutes. Solve this by adding a flag in the dentry that tells inotify whether or not its parent inode has a watch on it. Event queueing to parent will skip taking locks if this flag is cleared. Setting and clearing of this flag on all child dentries versus event delivery: this is no in terms of race cases, and that was shown to be equivalent to always performing the check. The essential behaviour is that activity occuring _after_ a watch has been added and _before_ it has been removed, will generate events. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Robert Love <rml@novell.com> Cc: John McCutchan <ttb@tentacle.dhs.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
214fda1f |
|
25-Mar-2006 |
David Howells <dhowells@redhat.com> |
[PATCH] Optimise d_find_alias() The attached patch optimises d_find_alias() to only take the spinlock if there's anything in the the inode's alias list. If there isn't, it returns NULL immediately. With respect to the superblock sharing patch, this should reduce by one the number of times the dcache_lock is taken by nfs_lookup() for ordinary directory lookups. Only in the case where there's already a dentry for particular directory inode (such as might happen when another mountpoint is rooted at that dentry) will the lock then be taken the extra time. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
b0196009 |
|
24-Mar-2006 |
Paul Jackson <pj@sgi.com> |
[PATCH] cpuset memory spread slab cache hooks Change the kmem_cache_create calls for certain slab caches to support cpuset memory spreading. See the previous patches, cpuset_mem_spread, for an explanation of cpuset memory spreading, and cpuset_mem_spread_slab_cache for the slab cache support for memory spreading. The slab caches marked for now are: dentry_cache, inode_cache, some xfs slab caches, and buffer_head. This list may change over time. In particular, other file system types that are used extensively on large NUMA systems may want to allow for spreading their directory and inode slab cache entries. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
529bf6be |
|
07-Mar-2006 |
Dipankar Sarma <dipankar@in.ibm.com> |
[PATCH] fix file counting I have benchmarked this on an x86_64 NUMA system and see no significant performance difference on kernbench. Tested on both x86_64 and powerpc. The way we do file struct accounting is not very suitable for batched freeing. For scalability reasons, file accounting was constructor/destructor based. This meant that nr_files was decremented only when the object was removed from the slab cache. This is susceptible to slab fragmentation. With RCU based file structure, consequent batched freeing and a test program like Serge's, we just speed this up and end up with a very fragmented slab - llm22:~ # cat /proc/sys/fs/file-nr 587730 0 758844 At the same time, I see only a 2000+ objects in filp cache. The following patch I fixes this problem. This patch changes the file counting by removing the filp_count_lock. Instead we use a separate percpu counter, nr_files, for now and all accesses to it are through get_nr_files() api. In the sysctl handler for nr_files, we populate files_stat.nr_files before returning to user. Counting files as an when they are created and destroyed (as opposed to inside slab) allows us to correctly count open files with RCU. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
47ba87e0 |
|
03-Feb-2006 |
Marcelo Tosatti <marcelo.tosatti@cyclades.com> |
[PATCH] make "struct d_cookie" depend on CONFIG_PROFILING Shrinks "struct dentry" from 128 bytes to 124 on x86, allowing 31 objects per slab instead of 30. Cc: John Levon <levon@movementarian.org> Cc: Philippe Elie <phil.el@wanadoo.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
858119e1 |
|
14-Jan-2006 |
Arjan van de Ven <arjan@infradead.org> |
[PATCH] Unlinline a bunch of other functions Remove the "inline" keyword from a bunch of big functions in the kernel with the goal of shrinking it by 30kb to 40kb Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
e866cfa9 |
|
09-Jan-2006 |
Oleg Drokin <green@linuxhacker.ru> |
[PATCH] d_instantiate_unique / NFS inode leakage If we have found aliased dentry that we return, inode reference is not dropped and inode is not attached anywhere, so it seems the reference to inode is leaked in that case. Cc: Trond Myklebust <trond.myklebust@fys.uio.no>, Cc: <viro@parcelfarce.linux.theplanet.co.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
5160ee6f |
|
08-Jan-2006 |
Eric Dumazet <dada1@cosmosbay.com> |
[PATCH] shrink dentry struct Some long time ago, dentry struct was carefully tuned so that on 32 bits UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple of memory cache lines. Then RCU was added and dentry struct enlarged by two pointers, with nice results for SMP, but not so good on UP, because breaking the above tuning (128 + 8 = 136 bytes) This patch reverts this unwanted side effect, by using an union (d_u), where d_rcu and d_child are placed so that these two fields can share their memory needs. At the time d_free() is called (and d_rcu is really used), d_child is known to be empty and not touched by the dentry freeing. Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so the previous content of d_child is not needed if said dentry was unhashed but still accessed by a CPU because of RCU constraints) As dentry cache easily contains millions of entries, a size reduction is worth the extra complexity of the ugly C union. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Ian Kent <raven@themaw.net> Cc: Paul Jackson <pj@sgi.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@cse.unsw.edu.au> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@epoch.ncsc.mil> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
665a7583 |
|
07-Nov-2005 |
Paul E. McKenney <paulmck@kernel.org> |
[PATCH] Remove hlist_for_each_rcu() API, convert existing use to hlist_for_each_entry_rcu Remove the hlist_for_each_rcu() API, which is used only in one place, and is trivially converted to hlist_for_each_entry_rcu(), making the code shorter and more readable. Any out-of-tree uses may be similarly converted. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
27496a8c |
|
21-Oct-2005 |
Al Viro <viro@zeniv.linux.org.uk> |
[PATCH] gfp_t: fs/* - ->releasepage() annotated (s/int/gfp_t), instances updated - missing gfp_t in fs/* added - fixed misannotation from the original sweep caught by bitwise checks: XFS used __nocast both for gfp_t and for flags used by XFS allocator. The latter left with unsigned int __nocast; we might want to add a different type for those but for now let's leave them alone. That, BTW, is a case when __nocast use had been actively confusing - it had been used in the same code for two different and similar types, with no way to catch misuses. Switch of gfp_t to bitwise had caught that immediately... One tricky bit is left alone to be dealt with later - mapping->flags is a mix of gfp_t and error indications. Left alone for now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
f805fbda |
|
19-Sep-2005 |
Linus Torvalds <torvalds@g5.osdl.org> |
Make fsnotify possibly work better for the inode removal case Checking i_nlink is dubious, but the alternatives look even less appetizing. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
0cdca3f9 |
|
10-Sep-2005 |
Domen Puncer <domen@coderock.org> |
[PATCH] janitor: fs/dcache.c: list_for_each* First one is list_for_each_entry (thanks maks), second 2 list_for_each_safe. Signed-off-by: Maximilian Attems <janitor@sternwelten.at> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
7a91bf7f |
|
08-Aug-2005 |
John McCutchan <ttb@tentacle.dhs.org> |
[PATCH] fsnotify_name/inoderemove The patch below unhooks fsnotify from vfs_unlink & vfs_rmdir. It introduces two new fsnotify calls, that are hooked in at the dcache level. This not only more closely matches how the VFS layer works, it also avoids the problem with locking and inode lifetimes. The two functions are - fsnotify_nameremove -- called when a directory entry is going away. It notifies the PARENT of the deletion. This is called from d_delete(). - inoderemove -- called when the files inode itself is going away. It notifies the inode that is being deleted. This is called from dentry_iput(). Signed-off-by: John McCutchan <ttb@tentacle.dhs.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
75c96f85 |
|
05-May-2005 |
Adrian Bunk <bunk@stusta.de> |
[PATCH] make some things static This patch makes some needlessly global identifiers static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Arjan van de Ven <arjanv@infradead.org> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
1da177e4 |
|
16-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|