History log of /linux-master/fs/kernfs/dir.c
Revision Date Author Comments
# 4207b556 09-Jan-2024 Tejun Heo <tj@kernel.org>

kernfs: RCU protect kernfs_nodes and avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()

The BPF helper bpf_cgroup_from_id() calls kernfs_find_and_get_node_by_id()
which acquires kernfs_idr_lock, which is an non-raw non-IRQ-safe lock. This
can lead to deadlocks as bpf_cgroup_from_id() can be called from any BPF
programs including e.g. the ones that attach to functions which are holding
the scheduler rq lock.

Consider the following BPF program:

SEC("fentry/__set_cpus_allowed_ptr_locked")
int BPF_PROG(__set_cpus_allowed_ptr_locked, struct task_struct *p,
struct affinity_context *affn_ctx, struct rq *rq, struct rq_flags *rf)
{
struct cgroup *cgrp = bpf_cgroup_from_id(p->cgroups->dfl_cgrp->kn->id);

if (cgrp) {
bpf_printk("%d[%s] in %s", p->pid, p->comm, cgrp->kn->name);
bpf_cgroup_release(cgrp);
}
return 0;
}

__set_cpus_allowed_ptr_locked() is called with rq lock held and the above
BPF program calls bpf_cgroup_from_id() within leading to the following
lockdep warning:

=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.7.0-rc3-work-00053-g07124366a1d7-dirty #147 Not tainted
-----------------------------------------------------
repro/1620 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffffffff833b3688 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1e/0x70

and this task is already holding:
ffff888237ced698 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0x4e/0xf0
which would create a new lock dependency:
(&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2}
...
Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(kernfs_idr_lock);
local_irq_disable();
lock(&rq->__lock);
lock(kernfs_idr_lock);
<Interrupt>
lock(&rq->__lock);

*** DEADLOCK ***
...
Call Trace:
dump_stack_lvl+0x55/0x70
dump_stack+0x10/0x20
__lock_acquire+0x781/0x2a40
lock_acquire+0xbf/0x1f0
_raw_spin_lock+0x2f/0x40
kernfs_find_and_get_node_by_id+0x1e/0x70
cgroup_get_from_id+0x21/0x240
bpf_cgroup_from_id+0xe/0x20
bpf_prog_98652316e9337a5a___set_cpus_allowed_ptr_locked+0x96/0x11a
bpf_trampoline_6442545632+0x4f/0x1000
__set_cpus_allowed_ptr_locked+0x5/0x5a0
sched_setaffinity+0x1b3/0x290
__x64_sys_sched_setaffinity+0x4f/0x60
do_syscall_64+0x40/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e

Let's fix it by protecting kernfs_node and kernfs_root with RCU and making
kernfs_find_and_get_node_by_id() acquire rcu_read_lock() instead of
kernfs_idr_lock.

This adds an rcu_head to kernfs_node making it larger by 16 bytes on 64bit.
Combined with the preceding rearrange patch, the net increase is 8 bytes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20240109214828.252092-4-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# e3977e06 09-Jan-2024 Tejun Heo <tj@kernel.org>

Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"

This reverts commit dad3fb67ca1cbef87ce700e83a55835e5921ce8a.

The commit converted kernfs_idr_lock to an IRQ-safe raw_spinlock because it
could be acquired while holding an rq lock through bpf_cgroup_from_id().
However, kernfs_idr_lock is held while doing GPF_NOWAIT allocations which
involves acquiring an non-IRQ-safe and non-raw lock leading to the following
lockdep warning:

=============================
[ BUG: Invalid wait context ]
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
other info that might help us debug this:
context-{5:5}
2 locks held by swapper/0/0:
#0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
#1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from local_lock_acquire+0x28/0xa4
local_lock_acquire from ___slab_alloc+0x234/0x8a8
___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
__slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
idr_get_free from idr_alloc_u32+0x9c/0x108
idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
__kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
kernfs_create_root from sysfs_init+0x18/0x5c
sysfs_init from mnt_init+0xc4/0x220
mnt_init from vfs_caches_init+0x6c/0x88
vfs_caches_init from start_kernel+0x474/0x528
start_kernel from 0x0

Let's rever the commit. It's undesirable to spread out raw spinlock usage
anyway and the problem can be solved by protecting the lookup path with RCU
instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/CAMuHMdV=AKt+mwY7svEq5gFPx41LoSQZ_USME5_MEdWQze13ww@mail.gmail.com
Link: https://lore.kernel.org/r/20240109214828.252092-2-tj@kernel.org
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c312828c 29-Dec-2023 Andrea Righi <andrea.righi@canonical.com>

kernfs: convert kernfs_idr_lock to an irq safe raw spinlock

bpf_cgroup_from_id() is basically a wrapper to cgroup_get_from_id(),
that is relying on kernfs to determine the right cgroup associated to
the target id.

As a kfunc, it has the potential to be attached to any function through
BPF, particularly in contexts where certain locks are held.

However, kernfs is not using an irq safe spinlock for kernfs_idr_lock,
that means any kernfs function that is acquiring this lock can be
interrupted and potentially hit bpf_cgroup_from_id() in the process,
triggering a deadlock.

For example, it is really easy to trigger a lockdep splat between
kernfs_idr_lock and rq->_lock, attaching a small BPF program to
__set_cpus_allowed_ptr_locked() that just calls bpf_cgroup_from_id():

=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.7.0-rc7-virtme #5 Not tainted
-----------------------------------------------------
repro/131 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffffffffb2dc4578 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1d/0x80

and this task is already holding:
ffff911cbecaf218 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0x50/0xc0
which would create a new lock dependency:
(&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2}

but this new dependency connects a HARDIRQ-irq-safe lock:
(&rq->__lock){-.-.}-{2:2}

... which became HARDIRQ-irq-safe at:
lock_acquire+0xbf/0x2b0
_raw_spin_lock_nested+0x2e/0x40
scheduler_tick+0x5d/0x170
update_process_times+0x9c/0xb0
tick_periodic+0x27/0xe0
tick_handle_periodic+0x24/0x70
__sysvec_apic_timer_interrupt+0x64/0x1a0
sysvec_apic_timer_interrupt+0x6f/0x80
asm_sysvec_apic_timer_interrupt+0x1a/0x20
memcpy+0xc/0x20
arch_dup_task_struct+0x15/0x30
copy_process+0x1ce/0x1eb0
kernel_clone+0xac/0x390
kernel_thread+0x6f/0xa0
kthreadd+0x199/0x230
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1b/0x30

to a HARDIRQ-irq-unsafe lock:
(kernfs_idr_lock){+.+.}-{2:2}

... which became HARDIRQ-irq-unsafe at:
...
lock_acquire+0xbf/0x2b0
_raw_spin_lock+0x30/0x40
__kernfs_new_node.isra.0+0x83/0x280
kernfs_create_root+0xf6/0x1d0
sysfs_init+0x1b/0x70
mnt_init+0xd9/0x2a0
vfs_caches_init+0xcf/0xe0
start_kernel+0x58a/0x6a0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xc5/0xe0
secondary_startup_64_no_verify+0x178/0x17b

other info that might help us debug this:

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(kernfs_idr_lock);
local_irq_disable();
lock(&rq->__lock);
lock(kernfs_idr_lock);
<Interrupt>
lock(&rq->__lock);

*** DEADLOCK ***

Prevent this deadlock condition converting kernfs_idr_lock to a raw irq
safe spinlock.

The performance impact of this change should be negligible and it also
helps to prevent similar deadlock conditions with any other subsystems
that may depend on kernfs.

Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc")
Cc: stable <stable@kernel.org>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231229074916.53547-1-andrea.righi@canonical.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ff6d413b 12-Dec-2023 Kees Cook <keescook@chromium.org>

kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()

One of the last remaining users of strlcpy() in the kernel is
kernfs_path_from_node_locked(), which passes back the problematic "length
we _would_ have copied" return value to indicate truncation. Convert the
chain of all callers to use the negative return value (some of which
already doing this explicitly). All callers were already also checking
for negative return values, so the risk to missed checks looks very low.

In this analysis, it was found that cgroup1_release_agent() actually
didn't handle the "too large" condition, so this is technically also a
bug fix. :)

Here's the chain of callers, and resolution identifying each one as now
handling the correct return value:

kernfs_path_from_node_locked()
kernfs_path_from_node()
pr_cont_kernfs_path()
returns void
kernfs_path()
sysfs_warn_dup()
return value ignored
cgroup_path()
blkg_path()
bfq_bic_update_cgroup()
return value ignored
TRACE_IOCG_PATH()
return value ignored
TRACE_CGROUP_PATH()
return value ignored
perf_event_cgroup()
return value ignored
task_group_path()
return value ignored
damon_sysfs_memcg_path_eq()
return value ignored
get_mm_memcg_path()
return value ignored
lru_gen_seq_show()
return value ignored
cgroup_path_from_kernfs_id()
return value ignored
cgroup_show_path()
already converted "too large" error to negative value
cgroup_path_ns_locked()
cgroup_path_ns()
bpf_iter_cgroup_show_fdinfo()
return value ignored
cgroup1_release_agent()
wasn't checking "too large" error
proc_cgroup_show()
already converted "too large" to negative value

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: <cgroups@vger.kernel.org>
Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 5b56bf5c 12-Dec-2023 Kees Cook <keescook@chromium.org>

kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()

strlcpy() reads the entire source buffer first. This read may exceed
the destination size limit. This is both inefficient and can lead
to linear read overflows if a source string is not NUL-terminated[1].
Additionally, it returns the size of the source string, not the
resulting size of the destination string. In an effort to remove strlcpy()
completely[2], replace strlcpy() here with strscpy().

Nothing actually checks the return value coming from kernfs_name_locked(),
so this has no impact on error paths. The caller hierarchy is:

kernfs_name_locked()
kernfs_name()
pr_cont_kernfs_name()
return value ignored
cgroup_name()
current_css_set_cg_links_read()
return value ignored
print_page_owner_memcg()
return value ignored

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1]
Link: https://github.com/KSPP/linux/issues/89 [2]
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Azeem Shaikh <azeemshaikh38@gmail.com>
Link: https://lore.kernel.org/r/20231116192127.1558276-2-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212211741.164376-2-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 792e0476 12-Dec-2023 Kees Cook <keescook@chromium.org>

kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()

strlcpy() reads the entire source buffer first. This read may exceed
the destination size limit. This is both inefficient and can lead
to linear read overflows if a source string is not NUL-terminated[1].
Additionally, it returns the size of the source string, not the
resulting size of the destination string. In an effort to remove strlcpy()
completely[2], replace strlcpy() here with strscpy().

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1]
Link: https://github.com/KSPP/linux/issues/89 [2]
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Azeem Shaikh <azeemshaikh38@gmail.com>
Link: https://lore.kernel.org/r/20231116192127.1558276-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212211741.164376-1-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 5133bee6 08-Dec-2023 Max Kellermann <max.kellermann@ionos.com>

fs/kernfs/dir: obey S_ISGID

Handling of S_ISGID is usually done by inode_init_owner() in all other
filesystems, but kernfs doesn't use that function. In kernfs, struct
kernfs_node is the primary data structure, and struct inode is only
created from it on demand. Therefore, inode_init_owner() can't be
used and we need to imitate its behavior.

S_ISGID support is useful for the cgroup filesystem; it allows
subtrees managed by an unprivileged process to retain a certain owner
gid, which then enables sharing access to the subtree with another
unprivileged process.

--
v1 -> v2: minor coding style fix (comment)

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231208093310.297233-2-max.kellermann@ionos.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0559f630 05-Aug-2023 Ian Kent <raven@themaw.net>

kernfs: fix missing kernfs_iattr_rwsem locking

When the kernfs_iattr_rwsem was introduced a case was missed.

The update of the kernfs directory node child count was also protected
by the kernfs_rwsem and needs to be included in the change so that the
child count (and so the inode n_link attribute) does not change while
holding the rwsem for read.

Fixes: 9caf69614225 ("kernfs: Introduce separate rwsem to protect inode attributes.")
Cc: stable <stable@kernel.org>
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-By: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Eric Sandeen <sandeen@sandeen.net>
Link: https://lore.kernel.org/r/169128520941.68052.15749253469930138901.stgit@donald.themaw.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2daf18a7 08-Aug-2023 Hugh Dickins <hughd@google.com>

tmpfs,xattr: enable limited user extended attributes

Enable "user." extended attributes on tmpfs, limiting them by tracking
the space they occupy, and deducting that space from the limited ispace
(unless tmpfs mounted with nr_inodes=0 to leave that ispace unlimited).

tmpfs inodes and simple xattrs are both unswappable, and have to be in
lowmem on a 32-bit highmem kernel: so the ispace limit is appropriate
for xattrs, without any need for a further mount option.

Add simple_xattr_space() to give approximate but deterministic estimate
of the space taken up by each xattr: with simple_xattrs_free() outputting
the space freed if required (but kernfs and even some tmpfs usages do not
require that, so don't waste time on strlen'ing if not needed).

Security and trusted xattrs were already supported: for consistency and
simplicity, account them from the same pool; though there's a small risk
that a tmpfs with enough space before would now be considered too small.

When extended attributes are used, "df -i" does show more IUsed and less
IFree than can be explained by the inodes: document that (manpage later).

xfstests tests/generic which were not run on tmpfs before but now pass:
020 037 062 070 077 097 103 117 337 377 454 486 523 533 611 618 728
with no new failures.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Message-Id: <2e63b26e-df46-5baa-c7d6-f9a8dd3282c5@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 30480b98 22-May-2023 Muchun Song <muchun.song@linux.dev>

kernfs: fix missing kernfs_idr_lock to remove an ID from the IDR

The root->ino_idr is supposed to be protected by kernfs_idr_lock, fix
it.

Fixes: 488dee96bb62 ("kernfs: allow creating kernfs objects with arbitrary uid/gid")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230523024017.24851-1-songmuchun@bytedance.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 06fb4736 09-Mar-2023 Imran Khan <imran.f.khan@oracle.com>

kernfs: change kernfs_rename_lock into a read-write lock.

kernfs_rename_lock protects a node's ->parent and thus kernfs topology.
Thus it can be used in cases that rely on a stable kernfs topology.
Change it to a read-write lock for better scalability.

Suggested by: Al Viro <viro@zeniv.linux.org.uk>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20230309110932.2889010-4-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c9f2dfb7 09-Mar-2023 Imran Khan <imran.f.khan@oracle.com>

kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.

Right now per-fs kernfs_rwsem protects list of kernfs_super_info instances
for a kernfs_root. Since kernfs_rwsem is used to synchronize several other
operations across kernfs and since most of these operations don't impact
kernfs_super_info, we can use a separate per-fs rwsem to synchronize access
to list of kernfs_super_info.
This helps in reducing contention around kernfs_rwsem and also allows
operations that change/access list of kernfs_super_info to proceed without
contending for kernfs_rwsem.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20230309110932.2889010-3-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9caf6961 09-Mar-2023 Imran Khan <imran.f.khan@oracle.com>

kernfs: Introduce separate rwsem to protect inode attributes.

Right now a global per-fs rwsem (kernfs_rwsem) synchronizes multiple
kernfs operations. On a large system with few hundred CPUs and few
hundred applications simultaneoulsy trying to access sysfs, this
results in multiple sys_open(s) contending on kernfs_rwsem via
kernfs_iop_permission and kernfs_dop_revalidate.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];

int ret_len;
int ret_fd;
ssize_t ret_rd;

ub4 i, saved_errno;

memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));

ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);

ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}

/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);

if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);

continue;
}

close(ret_fd);
}
}

I see contention around kernfs_rwsem as follows:

path_openat
|
|----link_path_walk.part.0.constprop.0
| |
| |--49.92%--inode_permission
| | |
| | --48.69%--kernfs_iop_permission
| | |
| | |--18.16%--down_read
| | |
| | |--15.38%--up_read
| | |
| | --14.58%--_raw_spin_lock
| | |
| | -----
| |
| |--29.08%--walk_component
| | |
| | --29.02%--lookup_fast
| | |
| | |--24.26%--kernfs_dop_revalidate
| | | |
| | | |--14.97%--down_read
| | | |
| | | --9.01%--up_read
| | |
| | --4.74%--__d_lookup
| | |
| | --4.64%--_raw_spin_lock
| | |
| | ----

Having a separate per-fs rwsem to protect kernfs inode attributes,
will avoid the above mentioned contention and result in better
performance as can bee seen below:

path_openat
|
|----link_path_walk.part.0.constprop.0
| |
| |
| |--27.06%--inode_permission
| | |
| | --25.84%--kernfs_iop_permission
| | |
| | |--9.29%--up_read
| | |
| | |--8.19%--down_read
| | |
| | --7.89%--_raw_spin_lock
| | |
| | ----
| |
| |--22.42%--walk_component
| | |
| | --22.36%--lookup_fast
| | |
| | |--16.07%--__d_lookup
| | | |
| | | --16.01%--_raw_spin_lock
| | | |
| | | ----
| | |
| | --6.28%--kernfs_dop_revalidate
| | |
| | |--3.76%--down_read
| | |
| | --2.26%--up_read

As can be seen from the above data the overhead due to both
kerfs_iop_permission and kernfs_dop_revalidate have gone down and
this also reduces overall run time of the earlier mentioned loop.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20230309110932.2889010-2-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 364595a6 30-Mar-2023 Jeff Layton <jlayton@kernel.org>

fs: consolidate duplicate dt_type helpers

There are three copies of the same dt_type helper sprinkled around the
tree. Convert them to use the common fs_umode_to_dtype function instead,
which has the added advantage of properly returning DT_UNKNOWN when
given a mode that contains an unrecognized type.

Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230330104144.75547-1-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 95b2a034 26-Nov-2022 Zhen Lei <thunder.leizhen@huawei.com>

kernfs: remove an unused if statement in kernfs_path_from_node_locked()

It makes no sense to call kernfs_path_from_node_locked() with NULL buf,
and no one is doing that right now.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221126111634.1994-1-thunder.leizhen@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# e18275ae 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->rename() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c54bd91e 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->mkdir() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 24b3e3dd 11-Nov-2022 Randy Dunlap <rdunlap@infradead.org>

kernfs: fix all kernel-doc warnings and multiple typos

Fix kernel-doc warnings. Many of these are about a function's
return value, so use the kernel-doc Return: format to fix those

Use % prefix on numeric constant values.

dir.c: fix typos/spellos
file.c fix typo: s/taret/target/

Fix all of these kernel-doc warnings:

dir.c:305: warning: missing initial short description on line:
* kernfs_name_hash

dir.c:137: warning: No description found for return value of 'kernfs_path_from_node_locked'
dir.c:196: warning: No description found for return value of 'kernfs_name'
dir.c:224: warning: No description found for return value of 'kernfs_path_from_node'
dir.c:292: warning: No description found for return value of 'kernfs_get_parent'
dir.c:312: warning: No description found for return value of 'kernfs_name_hash'
dir.c:404: warning: No description found for return value of 'kernfs_unlink_sibling'
dir.c:588: warning: No description found for return value of 'kernfs_node_from_dentry'
dir.c:806: warning: No description found for return value of 'kernfs_find_ns'
dir.c:879: warning: No description found for return value of 'kernfs_find_and_get_ns'
dir.c:904: warning: No description found for return value of 'kernfs_walk_and_get_ns'
dir.c:927: warning: No description found for return value of 'kernfs_create_root'
dir.c:996: warning: No description found for return value of 'kernfs_root_to_node'
dir.c:1016: warning: No description found for return value of 'kernfs_create_dir_ns'
dir.c:1048: warning: No description found for return value of 'kernfs_create_empty_dir'
dir.c:1306: warning: No description found for return value of 'kernfs_next_descendant_post'
dir.c:1568: warning: No description found for return value of 'kernfs_remove_self'
dir.c:1630: warning: No description found for return value of 'kernfs_remove_by_name_ns'
dir.c:1667: warning: No description found for return value of 'kernfs_rename_ns'

file.c:66: warning: No description found for return value of 'of_on'
file.c:88: warning: No description found for return value of 'kernfs_deref_open_node_locked'
file.c:1036: warning: No description found for return value of '__kernfs_create_file'

inode.c:100: warning: No description found for return value of 'kernfs_setattr'

mount.c:160: warning: No description found for return value of 'kernfs_root_from_sb'
mount.c:198: warning: No description found for return value of 'kernfs_node_dentry'
mount.c:302: warning: No description found for return value of 'kernfs_super_ns'
mount.c:318: warning: No description found for return value of 'kernfs_get_tree'

symlink.c:28: warning: No description found for return value of 'kernfs_create_link'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221112031456.22980-1-rdunlap@infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 92b57842 17-Oct-2022 Ian Kent <raven@themaw.net>

kernfs: dont take i_lock on revalidate

In kernfs_dop_revalidate() when the passed in dentry is negative the
dentry directory is checked to see if it has changed and if so the
negative dentry is discarded so it can refreshed. During this check
the dentry inode i_lock is taken to mitigate against a possible
concurrent rename.

But if it's racing with a rename, becuase the dentry is negative, it
can't be the source it must be the target and it must be going to do
a d_move() otherwise the rename will return an error.

In this case the parent dentry of the target will not change, it will
be the same over the d_move(), only the source dentry parent may change
so the inode i_lock isn't needed.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/166606036967.13363.9336408133975631967.stgit@donald.themaw.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1edfe4ea 10-Oct-2022 Tejun Heo <tj@kernel.org>

kernfs: Fix spurious lockdep warning in kernfs_find_and_get_node_by_id()

c25491747b21 ("kernfs: Add KERNFS_REMOVING flags") made
kernfs_find_and_get_node_by_id() test kernfs_active() instead of
KERNFS_ACTIVATED. kernfs_find_and_get_by_id() is called without holding the
kernfs_rwsem triggering the following lockdep warning.

WARNING: CPU: 1 PID: 6191 at fs/kernfs/dir.c:36 kernfs_active+0xe8/0x120 fs/kernfs/dir.c:38
Modules linked in:
CPU: 1 PID: 6191 Comm: syz-executor.1 Not tainted 6.0.0-syzkaller-09413-g4899a36f91a9 #0
Hardware name: linux,dummy-virt (DT)
pstate: 10000005 (nzcV daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : kernfs_active+0xe8/0x120 fs/kernfs/dir.c:36
lr : lock_is_held include/linux/lockdep.h:283 [inline]
lr : kernfs_active+0x94/0x120 fs/kernfs/dir.c:36
sp : ffff8000182c7a00
x29: ffff8000182c7a00 x28: 0000000000000002 x27: 0000000000000001
x26: ffff00000ee1f6a8 x25: 1fffe00001dc3ed5 x24: 0000000000000000
x23: ffff80000ca1fba0 x22: ffff8000089efcb0 x21: 0000000000000001
x20: ffff0000091181d0 x19: ffff0000091181d0 x18: ffff00006a9e6b88
x17: 0000000000000000 x16: 0000000000000000 x15: ffff00006a9e6bc4
x14: 1ffff00003058f0e x13: 1fffe0000258c816 x12: ffff700003058f39
x11: 1ffff00003058f38 x10: ffff700003058f38 x9 : dfff800000000000
x8 : ffff80000e482f20 x7 : ffff0000091d8058 x6 : ffff80000e482c60
x5 : ffff000009402ee8 x4 : 1ffff00001bd1f46 x3 : 1fffe0000258c6d1
x2 : 0000000000000003 x1 : 00000000000000c0 x0 : 0000000000000000
Call trace:
kernfs_active+0xe8/0x120 fs/kernfs/dir.c:38
kernfs_find_and_get_node_by_id+0x6c/0x140 fs/kernfs/dir.c:708
__kernfs_fh_to_dentry fs/kernfs/mount.c:102 [inline]
kernfs_fh_to_dentry+0x88/0x1fc fs/kernfs/mount.c:128
exportfs_decode_fh_raw+0x104/0x560 fs/exportfs/expfs.c:435
exportfs_decode_fh+0x10/0x5c fs/exportfs/expfs.c:575
do_handle_to_path fs/fhandle.c:152 [inline]
handle_to_path fs/fhandle.c:207 [inline]
do_handle_open+0x2a4/0x7b0 fs/fhandle.c:223
__do_compat_sys_open_by_handle_at fs/fhandle.c:277 [inline]
__se_compat_sys_open_by_handle_at fs/fhandle.c:274 [inline]
__arm64_compat_sys_open_by_handle_at+0x6c/0x9c fs/fhandle.c:274
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall+0x6c/0x260 arch/arm64/kernel/syscall.c:52
el0_svc_common.constprop.0+0xc4/0x254 arch/arm64/kernel/syscall.c:142
do_el0_svc_compat+0x40/0x70 arch/arm64/kernel/syscall.c:212
el0_svc_compat+0x54/0x140 arch/arm64/kernel/entry-common.c:772
el0t_32_sync_handler+0x90/0x140 arch/arm64/kernel/entry-common.c:782
el0t_32_sync+0x190/0x194 arch/arm64/kernel/entry.S:586
irq event stamp: 232
hardirqs last enabled at (231): [<ffff8000081edf70>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1367 [inline]
hardirqs last enabled at (231): [<ffff8000081edf70>] finish_lock_switch kernel/sched/core.c:4943 [inline]
hardirqs last enabled at (231): [<ffff8000081edf70>] finish_task_switch.isra.0+0x200/0x880 kernel/sched/core.c:5061
hardirqs last disabled at (232): [<ffff80000c888bb4>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:404
softirqs last enabled at (228): [<ffff800008010938>] _stext+0x938/0xf58
softirqs last disabled at (207): [<ffff800008019380>] ____do_softirq+0x10/0x20 arch/arm64/kernel/irq.c:79
---[ end trace 0000000000000000 ]---

The lockdep warning in kernfs_active() is there to ensure that the activated
state stays stable for the caller. For kernfs_find_and_get_node_by_id(), all
that's needed is ensuring that a node which has never been activated can't
be looked up and guaranteeing lookup success when the caller knows the node
to be active, both of which can be achieved by testing the active count
without holding the kernfs_rwsem.

Fix the spurious warning by introducing __kernfs_active() which doesn't have
the lockdep annotation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: syzbot+590ce62b128e79cf0a35@syzkaller.appspotmail.com
Fixes: c25491747b21 ("kernfs: Add KERNFS_REMOVING flags")
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/r/Y0SwqBsZ9BMmZv6x@slm.duckdns.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4abc9965 13-Sep-2022 Christian A. Ehrhardt <lk@c--e.de>

kernfs: fix use-after-free in __kernfs_remove

Syzkaller managed to trigger concurrent calls to
kernfs_remove_by_name_ns() for the same file resulting in
a KASAN detected use-after-free. The race occurs when the root
node is freed during kernfs_drain().

To prevent this acquire an additional reference for the root
of the tree that is removed before calling __kernfs_remove().

Found by syzkaller with the following reproducer (slab_nomerge is
required):

syz_mount_image$ext4(0x0, &(0x7f0000000100)='./file0\x00', 0x100000, 0x0, 0x0, 0x0, 0x0)
r0 = openat(0xffffffffffffff9c, &(0x7f0000000080)='/proc/self/exe\x00', 0x0, 0x0)
close(r0)
pipe2(&(0x7f0000000140)={0xffffffffffffffff, <r1=>0xffffffffffffffff}, 0x800)
mount$9p_fd(0x0, &(0x7f0000000040)='./file0\x00', &(0x7f00000000c0), 0x408, &(0x7f0000000280)={'trans=fd,', {'rfdno', 0x3d, r0}, 0x2c, {'wfdno', 0x3d, r1}, 0x2c, {[{@cache_loose}, {@mmap}, {@loose}, {@loose}, {@mmap}], [{@mask={'mask', 0x3d, '^MAY_EXEC'}}, {@fsmagic={'fsmagic', 0x3d, 0x10001}}, {@dont_hash}]}})

Sample report:

==================================================================
BUG: KASAN: use-after-free in kernfs_type include/linux/kernfs.h:335 [inline]
BUG: KASAN: use-after-free in kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline]
BUG: KASAN: use-after-free in __kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369
Read of size 2 at addr ffff8880088807f0 by task syz-executor.2/857

CPU: 0 PID: 857 Comm: syz-executor.2 Not tainted 6.0.0-rc3-00363-g7726d4c3e60b #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x6e/0x91 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:317 [inline]
print_report.cold+0x5e/0x5e5 mm/kasan/report.c:433
kasan_report+0xa3/0x130 mm/kasan/report.c:495
kernfs_type include/linux/kernfs.h:335 [inline]
kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline]
__kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369
__kernfs_remove fs/kernfs/dir.c:1356 [inline]
kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589
sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943
__kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
create_cache mm/slab_common.c:229 [inline]
kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
p9_client_create+0xd4d/0x1190 net/9p/client.c:993
v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
vfs_get_tree+0x85/0x2e0 fs/super.c:1530
do_new_mount fs/namespace.c:3040 [inline]
path_mount+0x675/0x1d00 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x282/0x300 fs/namespace.c:3568
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f725f983aed
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f725f0f7028 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f725faa3f80 RCX: 00007f725f983aed
RDX: 00000000200000c0 RSI: 0000000020000040 RDI: 0000000000000000
RBP: 00007f725f9f419c R08: 0000000020000280 R09: 0000000000000000
R10: 0000000000000408 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000006 R14: 00007f725faa3f80 R15: 00007f725f0d7000
</TASK>

Allocated by task 855:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:45 [inline]
set_alloc_info mm/kasan/common.c:437 [inline]
__kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:470
kasan_slab_alloc include/linux/kasan.h:224 [inline]
slab_post_alloc_hook mm/slab.h:727 [inline]
slab_alloc_node mm/slub.c:3243 [inline]
slab_alloc mm/slub.c:3251 [inline]
__kmem_cache_alloc_lru mm/slub.c:3258 [inline]
kmem_cache_alloc+0xbf/0x200 mm/slub.c:3268
kmem_cache_zalloc include/linux/slab.h:723 [inline]
__kernfs_new_node+0xd4/0x680 fs/kernfs/dir.c:593
kernfs_new_node fs/kernfs/dir.c:655 [inline]
kernfs_create_dir_ns+0x9c/0x220 fs/kernfs/dir.c:1010
sysfs_create_dir_ns+0x127/0x290 fs/sysfs/dir.c:59
create_dir lib/kobject.c:63 [inline]
kobject_add_internal+0x24a/0x8d0 lib/kobject.c:223
kobject_add_varg lib/kobject.c:358 [inline]
kobject_init_and_add+0x101/0x160 lib/kobject.c:441
sysfs_slab_add+0x156/0x1e0 mm/slub.c:5954
__kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
create_cache mm/slab_common.c:229 [inline]
kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
p9_client_create+0xd4d/0x1190 net/9p/client.c:993
v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
vfs_get_tree+0x85/0x2e0 fs/super.c:1530
do_new_mount fs/namespace.c:3040 [inline]
path_mount+0x675/0x1d00 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x282/0x300 fs/namespace.c:3568
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 857:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:45
kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:367 [inline]
____kasan_slab_free mm/kasan/common.c:329 [inline]
__kasan_slab_free+0x108/0x190 mm/kasan/common.c:375
kasan_slab_free include/linux/kasan.h:200 [inline]
slab_free_hook mm/slub.c:1754 [inline]
slab_free_freelist_hook mm/slub.c:1780 [inline]
slab_free mm/slub.c:3534 [inline]
kmem_cache_free+0x9c/0x340 mm/slub.c:3551
kernfs_put.part.0+0x2b2/0x520 fs/kernfs/dir.c:547
kernfs_put+0x42/0x50 fs/kernfs/dir.c:521
__kernfs_remove.part.0+0x72d/0x960 fs/kernfs/dir.c:1407
__kernfs_remove fs/kernfs/dir.c:1356 [inline]
kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589
sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943
__kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
create_cache mm/slab_common.c:229 [inline]
kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
p9_client_create+0xd4d/0x1190 net/9p/client.c:993
v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
vfs_get_tree+0x85/0x2e0 fs/super.c:1530
do_new_mount fs/namespace.c:3040 [inline]
path_mount+0x675/0x1d00 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x282/0x300 fs/namespace.c:3568
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

The buggy address belongs to the object at ffff888008880780
which belongs to the cache kernfs_node_cache of size 128
The buggy address is located 112 bytes inside of
128-byte region [ffff888008880780, ffff888008880800)

The buggy address belongs to the physical page:
page:00000000732833f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8880
flags: 0x100000000000200(slab|node=0|zone=1)
raw: 0100000000000200 0000000000000000 dead000000000122 ffff888001147280
raw: 0000000000000000 0000000000150015 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888008880680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
ffff888008880700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888008880780: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888008880800: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
ffff888008880880: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================

Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable <stable@kernel.org> # -rc3
Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>
Link: https://lore.kernel.org/r/20220913121723.691454-1-lk@c--e.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 783bd07d 27-Aug-2022 Tejun Heo <tj@kernel.org>

kernfs: Implement kernfs_show()

Currently, kernfs nodes can be created hidden and activated later by calling
kernfs_activate() to allow creation of multiple nodes to succeed or fail as
a unit. This is an one-way one-time-only transition. This patch introduces
kernfs_show() which can toggle visibility dynamically.

As the currently proposed use - toggling the cgroup pressure files - only
requires operating on leaf nodes, for the sake of simplicity, restrict it as
such for now.

Hiding uses the same mechanism as deactivation and likewise guarantees that
there are no in-flight operations on completion. KERNFS_ACTIVATED and
KERNFS_HIDDEN are used to manage the interactions between activations and
show/hide operations. A node is visible iff both activated & !hidden.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-9-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f8eb145e 27-Aug-2022 Tejun Heo <tj@kernel.org>

kernfs: Factor out kernfs_activate_one()

Factor out kernfs_activate_one() from kernfs_activate() and reorder
operations so that KERNFS_ACTIVATED now simply indicates whether activation
was attempted on the node ignoring whether activation took place. As the
flag doesn't have a reader, the refactoring and reordering shouldn't cause
any behavior difference.

Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-8-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c2549174 27-Aug-2022 Tejun Heo <tj@kernel.org>

kernfs: Add KERNFS_REMOVING flags

KERNFS_ACTIVATED tracks whether a given node has ever been activated. As a
node was only deactivated on removal, this was used for

1. Drain optimization (removed by the previous patch).
2. To hide !activated nodes
3. To avoid double activations
4. Reject adding children to a node being removed
5. Skip activaing a node which is being removed.

We want to decouple deactivation from removal so that nodes can be
deactivated and hidden dynamically, which makes KERNFS_ACTIVATED useless for
all of the above purposes.

#1 is already gone. #2 and #3 can instead test whether the node is currently
active. A new flag KERNFS_REMOVING is added to explicitly mark nodes which
are being removed for #4 and #5.

While this leaves KERNFS_ACTIVATED with no users, leave it be as it will be
used in a following patch.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-7-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2d7f9f8c 27-Aug-2022 Tejun Heo <tj@kernel.org>

kernfs: Improve kernfs_drain() and always call on removal

__kernfs_remove() was skipping draining based on KERNFS_ACTIVATED - whether
the node has ever been activated since creation. Instead, update it to
always call kernfs_drain() which now drains or skips based on the precise
drain conditions. This ensures that the nodes will be deactivated and
drained regardless of their states.

This doesn't make meaningful difference now but will enable deactivating and
draining nodes dynamically by making removals safe when racing those
operations.

While at it, drop / update comments.

v2: Fix the inverted test on kernfs_should_drain_open_files() noted by
Chengming. This was fixed by the next unrelated patch in the previous
posting.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-6-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bdb2fd7f 27-Aug-2022 Tejun Heo <tj@kernel.org>

kernfs: Skip kernfs_drain_open_files() more aggressively

Track the number of mmapped files and files that need to be released and
skip kernfs_drain_open_file() if both are zero, which are the precise
conditions which require draining open_files. The early exit test is
factored into kernfs_should_drain_open_files() which is now tested by
kernfs_drain_open_files()'s caller - kernfs_drain().

This isn't a meaningful optimization on its own but will enable future
stand-alone kernfs_deactivate() implementation.

v2: Chengming noticed that on->nr_to_release was leaking after ->open()
failure. Fix it by telling kernfs_unlink_open_file() that it's called
from the ->open() fail path and should dec the counter. Use kzalloc() to
allocate kernfs_open_node so that the tracking fields are correctly
initialized.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-5-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 72b5d5ae 30-Jun-2022 Yushan Zhou <katrinzhou@tencent.com>

kernfs: fix potential NULL dereference in __kernfs_remove

When lockdep is enabled, lockdep_assert_held_write would
cause potential NULL pointer dereference.

Fix the following smatch warnings:

fs/kernfs/dir.c:1353 __kernfs_remove() warn: variable dereferenced before check 'kn' (see line 1346)

Fixes: 393c3714081a ("kernfs: switch global kernfs_rwsem lock to per-fs lock")
Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
Link: https://lore.kernel.org/r/20220630082512.3482581-1-zys.zljxml@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1a702dc8 16-May-2022 Hao Luo <haoluo@google.com>

kernfs: Separate kernfs_pr_cont_buf and rename_lock.

Previously the protection of kernfs_pr_cont_buf was piggy backed by
rename_lock, which means that pr_cont() needs to be protected under
rename_lock. This can cause potential circular lock dependencies.

If there is an OOM, we have the following call hierarchy:

-> cpuset_print_current_mems_allowed()
-> pr_cont_cgroup_name()
-> pr_cont_kernfs_name()

pr_cont_kernfs_name() will grab rename_lock and call printk. So we have
the following lock dependencies:

kernfs_rename_lock -> console_sem

Sometimes, printk does a wakeup before releasing console_sem, which has
the dependence chain:

console_sem -> p->pi_lock -> rq->lock

Now, imagine one wants to read cgroup_name under rq->lock, for example,
printing cgroup_name in a tracepoint in the scheduler code. They will
be holding rq->lock and take rename_lock:

rq->lock -> kernfs_rename_lock

Now they will deadlock.

A prevention to this circular lock dependency is to separate the
protection of pr_cont_buf from rename_lock. In principle, rename_lock
is to protect the integrity of cgroup name when copying to buf. Once
pr_cont_buf has got its content, rename_lock can be dropped. So it's
safe to drop rename_lock after kernfs_name_locked (and
kernfs_path_from_node_locked) and rely on a dedicated pr_cont_lock
to protect pr_cont_buf.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220516190951.3144144-1-haoluo@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ad8d8693 27-Apr-2022 Minchan Kim <minchan@kernel.org>

kernfs: fix NULL dereferencing in kernfs_remove

kernfs_remove supported NULL kernfs_node param to bail out but revent
per-fs lock change introduced regression that dereferencing the
param without NULL check so kernel goes crash.

This patch checks the NULL kernfs_node in kernfs_remove and if so,
just return.

Quote from bug report by Jirka

```
The bug is triggered by running NAS Parallel benchmark suite on
SuperMicro servers with 2x Xeon(R) Gold 6126 CPU. Here is the error
log:

[ 247.035564] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 247.036009] #PF: supervisor read access in kernel mode
[ 247.036009] #PF: error_code(0x0000) - not-present page
[ 247.036009] PGD 0 P4D 0
[ 247.036009] Oops: 0000 [#1] PREEMPT SMP PTI
[ 247.058060] CPU: 1 PID: 6546 Comm: umount Not tainted
5.16.0393c3714081a53795bbff0e985d24146def6f57f+ #16
[ 247.058060] Hardware name: Supermicro Super Server/X11DDW-L, BIOS
2.0b 03/07/2018
[ 247.058060] RIP: 0010:kernfs_remove+0x8/0x50
[ 247.058060] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 49 c7 c4 f4
ff ff ff eb b2 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00
41 54 55 <48> 8b 47 08 48 89 fd 48 85 c0 48 0f 44 c7 4c 8b 60 50 49 83
c4 60
[ 247.058060] RSP: 0018:ffffbbfa48a27e48 EFLAGS: 00010246
[ 247.058060] RAX: 0000000000000001 RBX: ffffffff89e31f98 RCX: 0000000080200018
[ 247.058060] RDX: 0000000080200019 RSI: fffff6760786c900 RDI: 0000000000000000
[ 247.058060] RBP: ffffffff89e31f98 R08: ffff926b61b24d00 R09: 0000000080200018
[ 247.122048] R10: ffff926b61b24d00 R11: ffff926a8040c000 R12: ffff927bd09a2000
[ 247.122048] R13: ffffffff89e31fa0 R14: dead000000000122 R15: dead000000000100
[ 247.122048] FS: 00007f01be0a8c40(0000) GS:ffff926fa8e40000(0000)
knlGS:0000000000000000
[ 247.122048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 247.122048] CR2: 0000000000000008 CR3: 00000001145c6003 CR4: 00000000007706e0
[ 247.122048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 247.122048] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 247.122048] PKRU: 55555554
[ 247.122048] Call Trace:
[ 247.122048] <TASK>
[ 247.122048] rdt_kill_sb+0x29d/0x350
[ 247.122048] deactivate_locked_super+0x36/0xa0
[ 247.122048] cleanup_mnt+0x131/0x190
[ 247.122048] task_work_run+0x5c/0x90
[ 247.122048] exit_to_user_mode_prepare+0x229/0x230
[ 247.122048] syscall_exit_to_user_mode+0x18/0x40
[ 247.122048] do_syscall_64+0x48/0x90
[ 247.122048] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 247.122048] RIP: 0033:0x7f01be2d735b
```

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215696
Link: https://lore.kernel.org/lkml/CAE4VaGDZr_4wzRn2___eDYRtmdPaGGJdzu_LCSkJYuY9BEO3cw@mail.gmail.com/
Fixes: 393c3714081a (kernfs: switch global kernfs_rwsem lock to per-fs lock)
Cc: stable@vger.kernel.org
Reported-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20220427172152.3505364-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f2eb478f 22-Feb-2022 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

kernfs: move struct kernfs_root out of the public view.

There is no need to have struct kernfs_root be part of kernfs.h for
the whole kernel to see and poke around it. Move it internal to kernfs
code and provide a helper function, kernfs_root_to_node(), to handle the
one field that kernfs users were directly accessing from the structure.

Cc: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 555a0ce4 01-Dec-2021 Minchan Kim <minchan@kernel.org>

kernfs: prevent early freeing of root node

Marek reported the warning below.

=========================
WARNING: held lock freed!
5.16.0-rc2+ #10984 Not tainted
-------------------------
kworker/1:0/18 is freeing memory ffff00004034e200-ffff00004034e3ff,
with a lock still held there!
ffff00004034e348 (&root->kernfs_rwsem){++++}-{3:3}, at:
__kernfs_remove+0x310/0x37c
3 locks held by kworker/1:0/18:
#0: ffff000040107938 ((wq_completion)cgroup_destroy){+.+.}-{0:0}, at:
process_one_work+0x1f0/0x6f0
#1: ffff80000b55bdc0
((work_completion)(&(&css->destroy_rwork)->work)){+.+.}-{0:0}, at:
process_one_work+0x1f0/0x6f0
#2: ffff00004034e348 (&root->kernfs_rwsem){++++}-{3:3}, at:
__kernfs_remove+0x310/0x37c

stack backtrace:
CPU: 1 PID: 18 Comm: kworker/1:0 Not tainted 5.16.0-rc2+ #10984
Hardware name: Raspberry Pi 4 Model B (DT)
Workqueue: cgroup_destroy css_free_rwork_fn
Call trace:
dump_backtrace+0x0/0x1ac
show_stack+0x18/0x24
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
debug_check_no_locks_freed+0x124/0x140
kfree+0xf0/0x3a4
kernfs_put+0x1f8/0x224
__kernfs_remove+0x1b8/0x37c
kernfs_destroy_root+0x38/0x50
css_free_rwork_fn+0x288/0x3d4
process_one_work+0x288/0x6f0
worker_thread+0x74/0x470
kthread+0x188/0x194
ret_from_fork+0x10/0x20

Since kernfs moves the kernfs_rwsem lock into root, it couldn't hold
the lock when the root node is tearing down. Thus, get the refcount
of root node.

Fixes: 393c3714081a ("kernfs: switch global kernfs_rwsem lock to per-fs lock")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20211201231648.1027165-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 393c3714 18-Nov-2021 Minchan Kim <minchan@kernel.org>

kernfs: switch global kernfs_rwsem lock to per-fs lock

The kernfs implementation has big lock granularity(kernfs_rwsem) so
every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the
lock. It makes trouble for some cases to wait the global lock
for a long time even though they are totally independent contexts
each other.

A general example is process A goes under direct reclaim with holding
the lock when it accessed the file in sysfs and process B is waiting
the lock with exclusive mode and then process C is waiting the lock
until process B could finish the job after it gets the lock from
process A.

This patch switches the global kernfs_rwsem to per-fs lock, which
put the rwsem into kernfs_root.

Suggested-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20211118230008.2679780-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 410d591a 03-Oct-2021 Ian Kent <raven@themaw.net>

kernfs: don't create a negative dentry if inactive node exists

It's been reported that doing stress test for module insertion and
removal can result in an ENOENT from libkmod for a valid module.

In kernfs_iop_lookup() a negative dentry is created if there's no kernfs
node associated with the dentry or the node is inactive.

But inactive kernfs nodes are meant to be invisible to the VFS and
creating a negative dentry for these can have unexpected side effects
when the node transitions to an active state.

The point of creating negative dentries is to avoid the expensive
alloc/free cycle that occurs if there are frequent lookups for kernfs
attributes that don't exist. So kernfs nodes that are not yet active
should not result in a negative dentry being created so when they
transition to an active state VFS lookups can create an associated
dentry is a natural way.

It's also been reported that https://github.com/osandov/blktests.git
test block/001 hangs during the test. It was suggested that recent
changes to blktests might have caused it but applying this patch
resolved the problem without change to blktests.

Fixes: c7e7c04274b1 ("kernfs: use VFS negative dentry caching")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
ACKed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/163330943316.19450.15056895533949392922.stgit@mickey.themaw.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# df38d852 28-Sep-2021 Hou Tao <houtao1@huawei.com>

kernfs: also call kernfs_set_rev() for positive dentry

A KMSAN warning is reported by Alexander Potapenko:

BUG: KMSAN: uninit-value in kernfs_dop_revalidate+0x61f/0x840
fs/kernfs/dir.c:1053
kernfs_dop_revalidate+0x61f/0x840 fs/kernfs/dir.c:1053
d_revalidate fs/namei.c:854
lookup_dcache fs/namei.c:1522
__lookup_hash+0x3a6/0x590 fs/namei.c:1543
filename_create+0x312/0x7c0 fs/namei.c:3657
do_mkdirat+0x103/0x930 fs/namei.c:3900
__do_sys_mkdir fs/namei.c:3931
__se_sys_mkdir fs/namei.c:3929
__x64_sys_mkdir+0xda/0x120 fs/namei.c:3929
do_syscall_x64 arch/x86/entry/common.c:51

It seems a positive dentry in kernfs becomes a negative dentry directly
through d_delete() in vfs_rmdir(). dentry->d_time is uninitialized
when accessing it in kernfs_dop_revalidate(), because it is only
initialized when created as negative dentry in kernfs_iop_lookup().

The problem can be reproduced by the following command:

cd /sys/fs/cgroup/pids && mkdir hi && stat hi && rmdir hi && stat hi

A simple fixes seems to be initializing d->d_time for positive dentry
in kernfs_iop_lookup() as well. The downside is the negative dentry
will be revalidated again after it becomes negative in d_delete(),
because the revison of its parent must have been increased due to
its removal.

Alternative solution is implement .d_iput for kernfs, and assign d_time
for the newly-generated negative dentry in it. But we may need to
take kernfs_rwsem to protect again the concurrent kernfs_link_sibling()
on the parent directory, it is a little over-killing. Now the simple
fix is chosen.

Link: https://marc.info/?l=linux-fsdevel&m=163249838610499
Fixes: c7e7c04274b1 ("kernfs: use VFS negative dentry caching")
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20210928140750.1274441-1-houtao1@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# df6192f4 16-Jul-2021 Ian Kent <raven@themaw.net>

kernfs: dont call d_splice_alias() under kernfs node lock

The call to d_splice_alias() in kernfs_iop_lookup() doesn't depend on
any kernfs node so there's no reason to hold the kernfs node lock when
calling it.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/162642772000.63632.10672683419693513226.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7ba0273b 16-Jul-2021 Ian Kent <raven@themaw.net>

kernfs: switch kernfs to use an rwsem

The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel during path walks.

Change the kernfs mutex to an rwsem so that, when opportunity arises,
node searches can be done in parallel with path walk lookups.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/162642770946.63632.2218304587223241374.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c7e7c042 16-Jul-2021 Ian Kent <raven@themaw.net>

kernfs: use VFS negative dentry caching

If there are many lookups for non-existent paths these negative lookups
can lead to a lot of overhead during path walks.

The VFS allows dentries to be created as negative and hashed, and caches
them so they can be used to reduce the fairly high overhead alloc/free
cycle that occurs during these lookups.

Use the kernfs node parent revision to identify if a change has been
made to the containing directory so that the negative dentry can be
discarded and the lookup redone.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/162642770420.63632.15791924970508867106.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 895adbec 16-Jul-2021 Ian Kent <raven@themaw.net>

kernfs: add a revision to identify directory node changes

Add a revision counter to kernfs directory nodes so it can be used
to detect if a directory node has changed during negative dentry
revalidation.

There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
on all architectures and as far as I know that assumption holds.

So adding a revision counter to the struct kernfs_elem_dir variant of
the kernfs_node type union won't increase the size of the kernfs_node
struct. This is because struct kernfs_elem_dir is at least
sizeof(pointer) smaller than the largest union variant. It's tempting
to make the revision counter a u64 but that would increase the size of
kernfs_node on archs where sizeof(pointer) is smaller than the revision
counter.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/162642769895.63632.8356662784964509867.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# d826e036 15-Jun-2021 Ian Kent <raven@themaw.net>

kernfs: move revalidate to be near lookup

While the dentry operation kernfs_dop_revalidate() is grouped with
dentry type functions it also has a strong affinity to the inode
operation ->lookup().

It makes sense to locate this function near to kernfs_iop_lookup()
because we will be adding VFS negative dentry caching to reduce path
lookup overhead for non-existent paths.

There's no functional change from this patch.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/162375275365.232295.8995526416263659926.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 0288e7fa 13-Nov-2020 Hui Su <sh_def@163.com>

fs/kernfs: remove the double check of dentry->inode

In both kernfs_node_from_dentry() and in
kernfs_dentry_node(), we will check the dentry->inode
is NULL or not, which is superfluous.

So remove the check in kernfs_node_from_dentry().

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hui Su <sh_def@163.com>
Link: https://lore.kernel.org/r/20201113132143.GA119541@rlk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 21774fd8 15-Oct-2020 Willem de Bruijn <willemb@google.com>

kernfs: bring names in comments in line with code

Fix two stragglers in the comments of the below rename operation.

Fixes: adc5e8b58f48 ("kernfs: drop s_ prefix from kernfs_node members")
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201015185726.1386868-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 5bf33f04 30-Dec-2019 Mateusz Nosek <mateusznosek0@gmail.com>

fs/kernfs/dir.c: Clean code by removing always true condition

Previously there was an additional check if variable pos is not null.
However, this check happens after entering while loop and only then,
which can happen only if pos is not null.
Therefore the additional check is redundant and can be removed.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20191230191628.21099-1-mateusznosek0@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 40430452 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: use 64bit inos if ino_t is 64bit

Each kernfs_node is identified with a 64bit ID. The low 32bit is
exposed as ino and the high gen. While this already allows using inos
as keys by looking up with wildcard generation number of 0, it's
adding unnecessary complications for 64bit ino archs which can
directly use kernfs_node IDs as inos to uniquely identify each cgroup
instance.

This patch exposes IDs directly as inos on 64bit ino archs. The
conversion is mostly straight-forward.

* 32bit ino archs behave the same as before. 64bit ino archs now use
the whole 64bit ID as ino and the generation number is fixed at 1.

* 64bit inos still use the same idr allocator which gurantees that the
lower 32bits identify the current live instance uniquely and the
high 32bits are incremented whenever the low bits wrap. As the
upper 32bits are no longer used as gen and we don't wanna start ino
allocation with 33rd bit set, the initial value for highbits
allocation is changed to 0 on 64bit ino archs.

* blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify
the issuing cgroup. Userland builds FILEID_INO32_GEN fids from
these numbers to look up the cgroups. To remain compatible with the
behavior, always output (LOW32,HIGH32) which will be constructed
back to the original 64bit ID by __kernfs_fh_to_dentry().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>


# fe0f726c 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()

kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
specified ino. On top of that, kernfs_get_node_by_id() and
kernfs_fh_get_inode() implement full ID matching by testing the rest
of ID.

On surface, confusingly, the two are slightly different in that the
latter uses 0 gen as wildcard while the former doesn't - does it mean
that the latter can't uniquely identify inodes w/ 0 gen? In practice,
this is a distinction without a difference because generation number
starts at 1. There are no actual IDs with 0 gen, so it can always
safely used as wildcard.

Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
and removing now unnecessary kernfs_get_node_by_id().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 67c0496e 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: convert kernfs_node->id from union kernfs_node_id to u64

kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value. I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to. Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper. Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen. This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>


# 880df131 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes

kernfs node can be created in two separate steps - allocation and
activation. This is used to make kernfs nodes visible only after the
internal states attached to the node are fully initialized.
kernfs_find_and_get_node_by_id() currently allows lookups of nodes
which aren't activated yet and thus can expose nodes are which are
still being prepped by kernfs users.

Fix it by disallowing lookups of nodes which aren't activated yet.

kernfs_find_and_get_node_by_ino()

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>


# b680b081 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()

kernfs_find_and_get_node_by_ino() uses RCU protection. It's currently
a bit buggy because it can look up a node which hasn't been activated
yet and thus may end up exposing a node that the kernfs user is still
prepping.

While it can be fixed by pushing it further in the current direction,
it's already complicated and isn't clear whether the complexity is
justified. The main use of kernfs_find_and_get_node_by_ino() is for
exportfs operations. They aren't super hot and all the follow-up
operations (e.g. mapping to path) use normal locking anyway.

Let's switch to a dumber locking scheme and protect the lookup with
kernfs_idr_lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>


# e23f568a 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: fix ino wrap-around detection

When the 32bit ino wraps around, kernfs increments the generation
number to distinguish reused ino instances. The wrap-around detection
tests whether the allocated ino is lower than what the cursor but the
cursor is pointing to the next ino to allocate so the condition never
triggers.

Fix it by remembering the last ino and comparing against that.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 4a3ef68acacf ("kernfs: implement i_generation")
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@vger.kernel.org # v4.14+


# 5facae4f 18-Sep-2019 Qian Cai <cai@lca.pw>

locking/lockdep: Remove unused @nested argument from lock_release()

Since the following commit:

b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")

@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 8097c43b 08-Aug-2019 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: fix memleak in kernel_ops_readdir()"

This reverts commit cc798c83898ea0a77fcaa1a92afda35c3c3ded74.

Tony writes:
Somehow this causes a regression in Linux next for me where I'm
seeing lots of sysfs entries now missing under
/sys/bus/platform/devices.

For example, I now only see one .serial entry show up in sysfs.
Things work again if I revert commit cc798c83898e ("kernfs: fix
memleak inkernel_ops_readdir()"). Any ideas why that would be?

Tejun says:
Ugh, you're right. It can get double-put cuz ctx->pos is put by
release too.

So reverting it for now.

Reported-by: Tony Lindgren <tony@atomide.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Fixes: cc798c83898e ("kernfs: fix memleak in kernel_ops_readdir()")
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# cc798c83 05-Aug-2019 Andrea Arcangeli <aarcange@redhat.com>

kernfs: fix memleak in kernel_ops_readdir()

If getdents64 is killed or hits on segfault, it'll leave cgroups
directories in sysfs pinned leaking memory because the kernfs node
won't be freed on rmdir and the parent neither.

Repro:

# for i in `seq 1000`; do mkdir $i; done
# rmdir *
# for i in `seq 1000`; do mkdir $i; done
# rmdir *

# for i in `seq 1000`; do while :; do ls $i/ >/dev/null; done & done
# while :; do killall ls; done

kernfs_node_cache in /proc/slabinfo keeps going up as expected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # goes way back to original sysfs days
Link: https://lore.kernel.org/r/20190805173404.GF136335@devbig004.ftw2.facebook.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bbe70e4e 23-Jul-2019 Jia-Ju Bai <baijiaju1990@gmail.com>

fs: kernfs: Fix possible null-pointer dereferences in kernfs_path_from_node_locked()

In kernfs_path_from_node_locked(), there is an if statement on line 147
to check whether buf is NULL:
if (buf)

When buf is NULL, it is used on line 151:
len += strlcpy(buf + len, parent_str, ...)
and line 158:
len += strlcpy(buf + len, "/", ...)
and line 160:
len += strlcpy(buf + len, kn->name, ...)

Thus, possible null-pointer dereferences may occur.

To fix these possible bugs, buf is checked before being used.
If it is NULL, -EINVAL is returned.

These bugs are found by a static analysis tool STCheck written by us.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Link: https://lore.kernel.org/r/20190724022242.27505-1-baijiaju1990@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2fd60da4 08-Jul-2019 Peng Wang <rocking@whu.edu.cn>

kernfs: fix potential null pointer dereference

Get root safely after kn is ensureed to be not null.

Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20190708151611.13242-1-rocking@whu.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 55716d26 01-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428

Based on 1 normalized pattern(s):

this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 99826790 16-Apr-2019 Andrea Parri <andrea.parri@amarulasolutions.com>

kernfs: fix barrier usage in __kernfs_new_node()

smp_mb__before_atomic() can not be applied to atomic_set(). Remove the
barrier and rely on RELEASE synchronization.

Fixes: ba16b2846a8c6 ("kernfs: add an API to get kernfs node from inode number")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# e19dfdc8 22-Feb-2019 Ondrej Mosnacek <omosnace@redhat.com>

kernfs: initialize security of newly created nodes

Use the new security_kernfs_init_security() hook to allow LSMs to
possibly assign a non-default security context to a newly created kernfs
node based on the attributes of the new node and also its parent node.

This fixes an issue with cgroupfs under SELinux, where newly created
cgroup subdirectories/files would not inherit its parent's context if
it had been set explicitly to a non-default value (other than the genfs
context specified by the policy). This can be reproduced as follows (on
Fedora/RHEL):

# mkdir /sys/fs/cgroup/unified/test
# # Need permissive to change the label under Fedora policy:
# setenforce 0
# chcon -t container_file_t /sys/fs/cgroup/unified/test
# ls -lZ /sys/fs/cgroup/unified
total 0
-r--r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.controllers
-rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.max.depth
-rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.max.descendants
-rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.procs
-r--r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.stat
-rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.subtree_control
-rw-r--r--. 1 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 cgroup.threads
drwxr-xr-x. 2 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 init.scope
drwxr-xr-x. 26 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:21 system.slice
drwxr-xr-x. 3 root root system_u:object_r:container_file_t:s0 0 Jan 29 03:15 test
drwxr-xr-x. 3 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:06 user.slice
# mkdir /sys/fs/cgroup/unified/test/subdir

Actual result:

# ls -ldZ /sys/fs/cgroup/unified/test/subdir
drwxr-xr-x. 2 root root system_u:object_r:cgroup_t:s0 0 Jan 29 03:15 /sys/fs/cgroup/unified/test/subdir

Expected result:

# ls -ldZ /sys/fs/cgroup/unified/test/subdir
drwxr-xr-x. 2 root root unconfined_u:object_r:container_file_t:s0 0 Jan 29 03:15 /sys/fs/cgroup/unified/test/subdir

Link: https://github.com/SELinuxProject/selinux-kernel/issues/39

Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 0ac6075a 22-Feb-2019 Ondrej Mosnacek <omosnace@redhat.com>

kernfs: use simple_xattrs for security attributes

Replace the special handling of security xattrs with simple_xattrs, as
is already done for the trusted xattrs. This simplifies the code and
allows LSMs to use more than just a single xattr to do their business.

Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: manual merge fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 05895219 22-Feb-2019 Ondrej Mosnacek <omosnace@redhat.com>

kernfs: clean up struct kernfs_iattrs

Right now, kernfs_iattrs embeds the whole struct iattr, even though it
doesn't really use half of its fields... This both leads to wasting
space and makes the code look awkward. Let's just list the few fields
we need directly in struct kernfs_iattrs.

Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: merged a number of chunks manually due to fuzz]
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 26e28d68 05-Feb-2019 Ayush Mittal <ayush.m@samsung.com>

kernfs: Allocating memory for kernfs_iattrs with kmem_cache.

Creating a new cache for kernfs_iattrs.
Currently, memory is allocated with kzalloc() which
always gives aligned memory. On ARM, this is 64 byte aligned.
To avoid the wastage of memory in aligning the size requested,
a new cache for kernfs_iattrs is created.

Size of struct kernfs_iattrs is 80 Bytes.
On ARM, it will come in kmalloc-128 slab.
and it will come in kmalloc-192 slab if debug info is enabled.
Extra bytes taken 48 bytes.

Total number of objects created : 4096
Total saving = 48*4096 = 192 KB

After creating new slab(When debug info is enabled) :
sh-3.2# cat /proc/slabinfo
...
kernfs_iattrs_cache 4069 4096 128 32 1 : tunables 0 0 0 : slabdata 128 128 0
...

All testing has been done on ARM target.

Signed-off-by: Ayush Mittal <ayush.m@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 488dee96 20-Jul-2018 Dmitry Torokhov <dmitry.torokhov@gmail.com>

kernfs: allow creating kernfs objects with arbitrary uid/gid

This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.

When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.

Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 95582b00 08-May-2018 Deepa Dinamani <deepa.kernel@gmail.com>

vfs: change inode times to use struct timespec64

struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}

@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}

@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>


# c53cd490 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: introduce kernfs_node_id

inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 319ba91d 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: don't set dentry->d_fsdata

When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ba16b284 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: add an API to get kernfs node from inode number

Add an API to get kernfs node from inode number. We will need this to
implement exportfs operations.

This API will be used in blktrace too later, so it should be as fast as
possible. To make the API lock free, kernfs node is freed in RCU
context. And we depend on kernfs_node count/ino number to filter out
stale kernfs nodes.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4a3ef68a 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: implement i_generation

Set i_generation for kernfs inode. This is required to implement
exportfs operations. The generation is 32-bit, so it's possible the
generation wraps up and we find stale files. To reduce the posssibility,
we don't reuse inode numer immediately. When the inode number allocation
wraps, we increase generation number. In this way generation/inode
number consist of a 64-bit number which is unlikely duplicated. This
does make the idr tree more sparse and waste some memory. Since idr
manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
bits of the key. In a 100k inode kernfs, the worst case will have around
300k radix tree node. Each node is 576bytes, so the tree will use about
~150M memory. Sounds not too bad, if this really is a problem, we should
find better data structure.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7d35079f 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: use idr instead of ida to manage inode number

kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 17627157 08-Feb-2017 Konstantin Khlebnikov <koct9i@gmail.com>

kernfs: handle null pointers while printing node name and path

Null kernfs nodes could be found at cgroups during construction.
It seems safer to handle these null pointers right in kernfs in
the same way as printf prints "(null)" for null pointer string.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0e67db2f 27-Dec-2016 Tejun Heo <tj@kernel.org>

kernfs: add kernfs_ops->open/release() callbacks

Add ->open/release() methods to kernfs_ops. ->open() is called when
the file is opened and ->release() when the file is either released or
severed. These callbacks can be used, for example, to manage
persistent caching objects over multiple seq_file iterations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>


# fd50ecad 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

vfs: Remove {get,set,remove}xattr inode operations

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e72a1a8b 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

kernfs: Switch to generic xattr handlers

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2773bf00 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

fs: rename "rename2" i_op to "rename"

Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 1cd66c93 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

fs: make remaining filesystems use .rename2

This is trivial to do:

- add flags argument to foo_rename()
- check if flags is zero
- assign foo_rename() to .rename2 instead of .rename

This doesn't mean it's impossible to support RENAME_NOREPLACE for these
filesystems, but it is not trivial, like for local filesystems.
RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
for a file to be created on one host while it is overwritten by rename on
another host).

Filesystems converted:

9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.

After this, we can get rid of the duplicate interfaces for rename.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Howells <dhowells@redhat.com> [AFS]
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Mark Fasheh <mfasheh@suse.com>


# bb09c863 10-Aug-2016 Tejun Heo <tj@kernel.org>

kernfs: remove kernfs_path_len()

It doesn't have any in-kernel user and the same result can be obtained
from kernfs_path(@kn, NULL, 0). Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>


# 3abb1d90 10-Aug-2016 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_path*() behave in the style of strlcpy()

kernfs_path*() functions always return the length of the full path but
the path content is undefined if the length is larger than the
provided buffer. This makes its behavior different from strlcpy() and
requires error handling in all its users even when they don't care
about truncation. In addition, the implementation can actully be
simplified by making it behave properly in strlcpy() style.

* Update kernfs_path_from_node_locked() to always fill up the buffer
with path. If the buffer is not large enough, the output is
truncated and terminated.

* kernfs_path() no longer needs error handling. Make it a simple
inline wrapper around kernfs_path_from_node().

* sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
Updated accordingly.

* cgroup_path()'s use of kernfs_path() updated to retain the old
behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>


# 8387ff25 10-Jun-2016 Linus Torvalds <torvalds@linux-foundation.org>

vfs: make the string hashes salt the hash

We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.

A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.

Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.

Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8cb0d2c1 20-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

kernfs: no point locking directory around that generic_file_llseek()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e99ed4de 17-Apr-2016 Serge Hallyn <serge.hallyn@ubuntu.com>

kernfs_path_from_node_locked: don't overwrite nlen

We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to. We use them
as such at the end. But in the loop copying the actual entries, we
overwrite @nlen. Use a temporary variable for that instead.

Without this, the return length, when the buffer is large enough, is
wrong. (When the buffer is NULL or too small, the returned value is
correct. The buffer contents are also correct.)

Interestingly, no callers of this function are affected by this as of
yet. However the upcoming cgroup_show_path() will be.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>


# 3a3a5fec 22-Feb-2016 Deepa Dinamani <deepa.kernel@gmail.com>

fs: kernfs: Replace CURRENT_TIME by current_fs_time()

This is in preparation for the series that transitions
filesystem timestamps to use 64 bit time and hence make
them y2038 safe.

CURRENT_TIME macro will be deleted before merging the
aforementioned series.

Use current_fs_time() instead of CURRENT_TIME for inode
timestamps.

struct kernfs_node is associated with a sysfs file/ directory.
Truncate the values to appropriate time granularity when
writing to inode timestamps of the files.

ktime_get_real_ts() is used to obtain times for
struct kernfs_iattrs. Since these times are later assigned to
inode times using timespec_truncate() for all filesystem based
operations, we can save the supers list traversal time here by
using ktime_get_real_ts() directly.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9f6df573 29-Jan-2016 Aditya Kali <adityakali@google.com>

kernfs: Add API to generate relative kernfs path

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# e56ed358 14-Jan-2016 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_walk_ns() use kernfs_pr_cont_buf[]

kernfs_walk_ns() uses a static path_buf[PATH_MAX] to separate out path
components. Keeping around the 4k buffer just for kernfs_walk_ns() is
wasteful. This patch makes it piggyback on kernfs_pr_cont_buf[]
instead. This requires kernfs_walk_ns() to hold kernfs_rename_lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b2a209ff 14-Jan-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

Revert "kernfs: do not account ino_ida allocations to memcg"

Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
alloc_kmem_pages call) are accounted to memory cgroup automatically.
Callers have to explicitly opt out if they don't want/need accounting
for some reason. Such a design decision leads to several problems:

- kmalloc users are highly sensitive to failures, many of them
implicitly rely on the fact that kmalloc never fails, while memcg
makes failures quite plausible.

- A lot of objects are shared among different containers by design.
Accounting such objects to one of containers is just unfair.
Moreover, it might lead to pinning a dead memcg along with its kmem
caches, which aren't tiny, which might result in noticeable increase
in memory consumption for no apparent reason in the long run.

- There are tons of short-lived objects. Accounting them to memcg will
only result in slight noise and won't change the overall picture, but
we still have to pay accounting overhead.

For more info, see

- http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
- http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

Therefore this patchset switches to the white list policy. Now kmalloc
users have to explicitly opt in by passing __GFP_ACCOUNT flag.

Currently, the list of accounted objects is quite limited and only
includes those allocations that (1) are known to be easily triggered
from userspace and (2) can fail gracefully (for the full list see patch
no. 6) and it still misses many object types. However, accounting only
those objects should be a satisfactory approximation of the behavior we
used to have for most sane workloads.

This patch (of 6):

Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
to memcg").

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So it was decided to switch to the white-list policy. This patch reverts
bits introducing the black-list policy. The white-list policy will be
introduced later in the series.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bd96f76a 20-Nov-2015 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_walk_and_get()

Implement kernfs_walk_and_get() which is similar to
kernfs_find_and_get() but can walk a path instead of just a name.

v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
David.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: David Miller <davem@davemloft.net>


# 9acee9c5 18-Aug-2015 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_path_len()

Add a function to determine the path length of a kernfs node. This
for now will be used by writeback tracepoint updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ea015218 13-May-2015 Eric W. Biederman <ebiederm@xmission.com>

kernfs: Add support for always empty directories.

Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 499611ed 14-May-2015 Vladimir Davydov <vdavydov.dev@gmail.com>

kernfs: do not account ino_ida allocations to memcg

root->ino_ida is used for kernfs inode number allocations. Since IDA has
a layered structure, different IDs can reside on the same layer, which
is currently accounted to some memory cgroup. The problem is that each
kmem cache of a memory cgroup has its own directory on sysfs (under
/sys/fs/kernel/<cache-name>/cgroup). If the inode number of such a
directory or any file in it gets allocated from a layer accounted to the
cgroup which the cache is created for, the cgroup will get pinned for
good, because one has to free all kmem allocations accounted to a cgroup
in order to release it and destroy all its kmem caches. That said we
must not account layers of ino_ida to any memory cgroup.

Since per net init operations may create new sysfs entries directly
(e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
per each namespace, which, in turn, creates new sysfs entries), an easy
way to reproduce this issue is by creating network namespace(s) from
inside a kmem-active memory cgroup.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [4.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b0143b5 17-Mar-2015 David Howells <dhowells@redhat.com>

VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dfeb0750 13-Feb-2015 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_STATIC_NAME

When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid
making a separate copy of its name. It's currently only used for sysfs
attributes whose filenames are required to stay accessible and unchanged.
There are rare exceptions where these names are allocated and formatted
dynamically but for the vast majority of cases they're consts in the
rodata section.

Now that kernfs is converted to use kstrdup_const() and kfree_const(),
there's little point in keeping KERNFS_STATIC_NAME around. Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75287a67 13-Feb-2015 Andrzej Hajda <a.hajda@samsung.com>

kernfs: convert node name allocation to kstrdup_const

sysfs frequently performs duplication of strings located in read-only
memory section. Replacing kstrdup by kstrdup_const allows to avoid such
operations.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72392ed0 05-Dec-2014 Rasmus Villemoes <linux@rasmusvillemoes.dk>

kernfs: Fix kernfs_name_compare

Returning a difference from a comparison functions is usually wrong
(see acbbe6fbb240 "kcmp: fix standard comparison bug" for the long
story). Here there is the additional twist that if the void pointers
ns and kn->ns happen to differ by a multiple of 2^32,
kernfs_name_compare returns 0, falsely reporting a match to the
caller.

Technically 'hash - kn->hash' is ok since the hashes are restricted to
31 bits, but it's better to avoid that subtlety.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 41d28bca 12-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

switch d_materialise_unique() users to d_splice_alias()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9b053f32 13-Feb-2014 Eric W. Biederman <ebiederm@xmission.com>

vfs: Remove unnecessary calls of check_submounts_and_drop

Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c1befb88 17-Apr-2014 Jianyu Zhan <nasa4836@gmail.com>

kernfs: fix a subdir count leak

Currently kernfs_link_sibling() increates parent->dir.subdirs before
adding the node into parent's chidren rb tree.

Because it is possible that kernfs_link_sibling() couldn't find
a suitable slot and bail out, this leads to a mismatch between
elevated subdir count with actual children node numbers.

This patches fix this problem, by moving the subdir accouting
after the actual addtion happening.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7d568a83 09-Apr-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_root->supers list

Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root. Let's implement it - the planned
inotify extension to kernfs_notify() needs it.

Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 88391d49 05-Mar-2014 Richard Cochran <richardcochran@gmail.com>

kernfs: fix off by one error.

The hash values 0 and 1 are reserved for magic directory entries, but
the code only prevents names hashing to 0. This patch fixes the test
to also prevent hash value 1.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f41c5934 14-Feb-2014 Li Zefan <lizefan@huawei.com>

kernfs: fix kernfs_node_from_dentry()

Currently kernfs_node_from_dentry() returns NULL for root dentry,
because root_dentry->d_op == NULL.

Due to this bug cgroupstats_build() returns -EINVAL for root cgroup.

# mount -t cgroup -o cpuacct /cgroup
# Documentation/accounting/getdelays -C /cgroup
fatal reply error, errno -22

With this fix:

# Documentation/accounting/getdelays -C /cgroup
sleeping 305, blocked 0, running 1, stopped 0, uninterruptible 1

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# e61734c5 12-Feb-2014 Tejun Heo <tj@kernel.org>

cgroup: remove cgroup->name

cgroup->name handling became quite complicated over time involving
dedicated struct cgroup_name for RCU protection. Now that cgroup is
on kernfs, we can drop all of it and simply use kernfs_name/path() and
friends. Replace cgroup->name and all related code with kernfs
name/path constructs.

* Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
of kernfs counterparts, which involves semantic changes.
pr_cont_cgroup_name() and pr_cont_cgroup_path() added.

* cgroup->name handling dropped from cgroup_rename().

* All users of cgroup_name/path() updated to the new semantics. Users
which were formatting the string just to printk them are converted
to use pr_cont_cgroup_name/path() instead, which simplifies things
quite a bit. As cgroup_name() no longer requires RCU read lock
around it, RCU lockings which were protecting only cgroup_name() are
removed.

v2: Comment above oom_info_lock updated as suggested by Michal.

v3: dummy_top doesn't have a kn associated and
pr_cont_cgroup_name/path() ended up calling the matching kernfs
functions with NULL kn leading to oops. Test for NULL kn and
print "/" if so. This issue was reported by Fengguang Wu.

v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


# 9561a896 10-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: fix hash calculation in kernfs_rename_ns()

3eef34ad7dc3 ("kernfs: implement kernfs_get_parent(),
kernfs_name/path() and friends") restructured kernfs_rename_ns() such
that new name assignment happens under kernfs_rename_lock;
unfortunately, it mistakenly passed NULL to kernfs_name_hash() to
calculate the new hash if the name hasn't changed, which can lead to
oops.

Fix it by using kn->name and kn->ns when calculating the new hash.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter dan.carpenter@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 3eef34ad 07-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends

kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.

Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.

* kernfs_name() - format the node's name into the specified buffer
* kernfs_path() - format the node's path into the specified buffer
* pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent() - pin and return a node's parent

All can be called under any context. The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.

v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
static inline making it cause a lot of build warnings. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0c23b225 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename()

Implement helpers to determine node from dentry and root from
super_block. Also add a kernfs_rename_ns() wrapper which assumes NULL
namespace. These generally make sense and will be used by cgroup.

v2: Some dummy implementations for !CONFIG_SYSFS was missing. Fixed.
Reported by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# d35258ef 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: allow nodes to be created in the deactivated state

Currently, kernfs_nodes are made visible to userland on creation,
which makes it difficult for kernfs users to atomically succeed or
fail creation of multiple nodes. In addition, if something fails
after creating some nodes, the created nodes might already be in use
and their active refs need to be drained for removal, which has the
potential to introduce tricky reverse locking dependency on active_ref
depending on how the error path is synchronized.

This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
If set, all nodes under the root are created in the deactivated state
and stay invisible to userland until explicitly enabled by the new
kernfs_activate() API. Also, nodes which have never been activated
are guaranteed to bypass draining on removal thus allowing error paths
to not worry about lockding dependency on active_ref draining.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# b9c9dad0 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: add missing kernfs_active() checks in directory operations

kernfs_iop_lookup(), kernfs_dir_pos() and kernfs_dir_next_pos() were
missing kernfs_active() tests before using the found kernfs_node. As
deactivated state is currently visible only while a node is being
removed, this doesn't pose an actual problem. e.g. lookup succeeding
on a deactivated node doesn't harm anything as the eventual file
operations are gonna fail and those failures are indistinguishible
from the cases in which the lookups had happened before the node was
deactivated.

However, we're gonna allow new nodes to be created deactivated and
then activated explicitly by the kernfs user when it sees fit. This
is to support atomically making multiple nodes visible to userland and
thus those nodes must not be visible to userland before activated.

Let's plug the lookup and readdir holes so that deactivated nodes are
invisible to userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 90c07c89 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: rename kernfs_dir_ops to kernfs_syscall_ops

We're gonna need non-dir syscall callbacks, which will make dir_ops a
misnomer. Let's rename kernfs_dir_ops to kernfs_syscall_ops.

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 07c7530d 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: invoke dir_ops while holding active ref of the target node

kernfs_dir_ops are currently being invoked without any active
reference, which makes it tricky for the invoked operations to
determine whether the objects associated those nodes are safe to
access and will remain that way for the duration of such operations.

kernfs already has active_ref mechanism to deal with this which makes
the removal of a given node the synchronization point for gating the
file operations. There's no reason for dir_ops to be any different.
Update the dir_ops handling so that active_ref is held while the
dir_ops are executing. This guarantees that while a dir_ops is
executing the target nodes stay alive.

As kernfs_dir_ops doesn't have any in-kernel user at this point, this
doesn't affect anybody.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 6b0afc2a 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers

Sometimes it's necessary to implement a node which wants to delete
nodes including itself. This isn't straightforward because of kernfs
active reference. While a file operation is in progress, an active
reference is held and kernfs_remove() waits for all such references to
drain before completing. For a self-deleting node, this is a deadlock
as kernfs_remove() ends up waiting for an active reference that itself
is sitting on top of.

This currently is worked around in the sysfs layer using
sysfs_schedule_callback() which makes such removals asynchronous.
While it works, it's rather cumbersome and inherently breaks
synchronicity of the operation - the file operation which triggered
the operation may complete before the removal is finished (or even
started) and the removal may fail asynchronously. If a removal
operation is immmediately followed by another operation which expects
the specific name to be available (e.g. removal followed by rename
onto the same name), there's no way to make the latter operation
reliable.

The thing is there's no inherent reason for this to be asynchrnous.
All that's necessary to do this synchronous is a dedicated operation
which drops its own active ref and deactivates self. This patch
implements kernfs_remove_self() and its wrappers in sysfs and driver
core. kernfs_remove_self() is to be called from one of the file
operations, drops the active ref the task is holding, removes the self
node, and restores active ref to the dead node so that the ref is
balanced afterwards. __kernfs_remove() is updated so that it takes an
early exit if the target node is already fully removed so that the
active ref restored by kernfs_remove_self() after removal doesn't
confuse the deactivation path.

This makes implementing self-deleting nodes very easy. The normal
removal path doesn't even need to be changed to use
kernfs_remove_self() for the self-deleting node. The method can
invoke kernfs_remove_self() on itself before proceeding the normal
removal path. kernfs_remove() invoked on the node by the normal
deletion path will simply be ignored.

This will replace sysfs_schedule_callback(). A subtle feature of
sysfs_schedule_callback() is that it collapses multiple invocations -
even if multiple removals are triggered, the removal callback is run
only once. An equivalent effect can be achieved by testing the return
value of kernfs_remove_self() - only the one which gets %true return
value should proceed with actual deletion. All other instances of
kernfs_remove_self() will wait till the enclosing kernfs operation
which invoked the winning instance of kernfs_remove_self() finishes
and then return %false. This trivially makes all users of
kernfs_remove_self() automatically show correct synchronous behavior
even when there are multiple concurrent operations - all "echo 1 >
delete" instances will finish only after the whole operation is
completed by one of the instances.

Note that manipulation of active ref is implemented in separate public
functions - kernfs_[un]break_active_protection().
kernfs_remove_self() is the only user at the moment but this will be
used to cater to more complex cases.

v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
and sysfs_remove_file_self() had incorrect return type. Fix it.
Reported by kbuild test bot.

v3: kernfs_[un]break_active_protection() separated out from
kernfs_remove_self() and exposed as public API.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 81c173cb 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_REMOVED

KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps that of deactivation.

It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations. There's no reason to have them separate making
things more complex than necessary.

This patch removes KERNFS_REMOVED.

* Instead of KERNFS_REMOVED, each node now starts its life
deactivated. This means that we now use both atomic_add() and
atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
generates an overflow warnings when negating INT_MIN as the negation
can't be represented as a positive number. Nothing is actually
broken but let's bump BIAS by one to avoid the warnings for archs
which negates the subtrahend..

* A new helper kernfs_active() which tests whether kn->active >= 0 is
added for convenience and lockdep annotation. All KERNFS_REMOVED
tests are replaced with negated kernfs_active() tests.

* __kernfs_remove() is updated to deactivate, but not drain, all nodes
in the subtree instead of setting KERNFS_REMOVED. This removes
deactivation from kernfs_deactivate(), which is now renamed to
kernfs_drain().

* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
checks on the active ref.

* Some comment style updates in the affected area.

v2: Reordered before removal path restructuring. kernfs_active()
dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
used in the lookup paths.

v3: Reverted most of v2 except for creating a new node with
KN_DEACTIVATED_BIAS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 182fd64b 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()

There currently are two mechanisms gating active ref lockdep
annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
The former disables lockdep annotations in kernfs_get/put_active()
while the latter disables all of kernfs_deactivate().

While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
deactivation step for non-file nodes, the benefit is marginal and it
needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF.

While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
flag so that it's more convenient and the related code can be compiled
out when not enabled.

v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
KERNFS_LOCKDEP flag"). As the earlier patch already added
KERNFS_LOCKDEP tests to kernfs_deactivate(), those additions are
dropped from this patch and the existing ones are simply converted
to kernfs_lockdep().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 988cd7af 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: remove kernfs_addrm_cxt

kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes. The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.

This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().

* kernfs_add_one() is updated to grab and release kernfs_mutex itself.
sysfs_addrm_start/finish() invocations around it are removed from
all users.

* __kernfs_remove() puts an unlinked node directly instead of chaining
it to kernfs_addrm_cxt. Its callers are updated to grab and release
kernfs_mutex instead of calling kernfs_addrm_start/finish() around
it.

v2: Rebased on top of "kernfs: associate a new kernfs_node with its
parent on creation" which dropped @parent from kernfs_add_one().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ccf02aaf 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: invoke kernfs_unmap_bin_file() directly from kernfs_deactivate()

kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
the target file before kernfs_remove() finishes; however, it currently
is being called from kernfs_addrm_finish() and has the same race
problem as the original implementation of deactivation when there are
multiple removers - only the remover which snatches the node to its
addrm_cxt->removed list is guaranteed to wait for its completion
before returning.

It can be easily fixed by moving kernfs_unmap_bin_file() invocation
from kernfs_addrm_finish() to kernfs_deactivated(). The function may
be called multiple times but that shouldn't do any harm.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 35beab06 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: restructure removal path to fix possible premature return

The recursive nature of kernfs_remove() means that, even if
kernfs_remove() is not allowed to be called multiple times on the same
node, there may be race conditions between removal of parent and its
descendants. While we can claim that kernfs_remove() shouldn't be
called on one of the descendants while the removal of an ancestor is
in progress, such rule is unnecessarily restrictive and very difficult
to enforce. It's better to simply allow invoking kernfs_remove() as
the caller sees fit as long as the caller ensures that the node is
accessible.

The current behavior in such situations is broken. Whoever enters
removal path first takes the node off the hierarchy and then
deactivates. Following removers either return as soon as it notices
that it's not the first one or can't even find the target node as it
has already been removed from the hierarchy. In both cases, the
following removers may finish prematurely while the nodes which should
be removed and drained are still being processed by the first one.

This patch restructures so that multiple removers, whether through
recursion or direction invocation, always follow the following rules.

* When there are multiple concurrent removers, only one puts the base
ref.

* Regardless of which one puts the base ref, all removers are blocked
until the target node is fully deactivated and removed.

To achieve the above, removal path now first marks all descendants
including self REMOVED and then deactivates and unlinks leftmost
descendant one-by-one. kernfs_deactivate() is called directly from
__kernfs_removal() and drops and regrabs kernfs_mutex for each
descendant to drain active refs. As this means that multiple removers
can enter kernfs_deactivate() for the same node, the function is
updated so that it can handle multiple deactivators of the same node -
only one actually deactivates but all wait till drain completion.

The restructured removal path guarantees that a removed node gets
unlinked only after the node is deactivated and drained. Combined
with proper multiple deactivator handling, this guarantees that any
invocation of kernfs_remove() returns only after the node itself and
all its descendants are deactivated, drained and removed.

v2: Draining separated into a separate loop (used to be in the same
loop as unlink) and done from __kernfs_deactivate(). This is to
allow exposing deactivation as a separate interface later.

Root node removal was broken in v1 patch. Fixed.

v3: Revert most of v2 except for root node removal fix and
simplification of KERNFS_REMOVED setting loop.

v4: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
KERNFS_LOCKDEP flag").

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# abd54f02 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq

kernfs_node->u.completion is used to notify deactivation completion
from kernfs_put_active() to kernfs_deactivate(). We now allow
multiple racing removals of the same node and the current removal
scheme is no longer correct - kernfs_remove() invocation may return
before the node is properly deactivated if it races against another
removal. The removal path will be restructured to address the issue.

To help such restructure which requires supporting multiple waiters,
this patch replaces kernfs_node->u.completion with
kernfs_root->deactivate_waitq. This makes deactivation event
notifications share a per-root waitqueue_head; however, the wait path
is quite cold and this will also allow shaving one pointer off
kernfs_node.

v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
KERNFS_LOCKDEP flag").

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a6607930 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag

kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set
before performing lockdep annotations and ends up feeding
uninitialized lockdep_map to lockdep triggering warning like the
following on USB stick hotunplug.

usb 1-2: USB disconnect, device number 2
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002
ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002
ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710
Call Trace:
[<ffffffff81cfb6bd>] dump_stack+0x4e/0x7a
[<ffffffff810f8530>] __lock_acquire+0x1910/0x1e70
[<ffffffff810f931a>] lock_acquire+0x9a/0x1d0
[<ffffffff8127c75e>] kernfs_deactivate+0xee/0x130
[<ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60
[<ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0
[<ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80
[<ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0
[<ffffffff8127b873>] sysfs_remove_groups+0x33/0x50
[<ffffffff8177d66d>] device_remove_attrs+0x4d/0x80
[<ffffffff8177e25e>] device_del+0x12e/0x1d0
[<ffffffff819722c2>] usb_disconnect+0x122/0x1a0
[<ffffffff819749b5>] hub_thread+0x3c5/0x1290
[<ffffffff810c6a6d>] kthread+0xed/0x110
[<ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0

Fix it by making kernfs_deactivate() perform lockdep annotations only
if KERNFS_LOCKDEP is set.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# da9846ae 28-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag

kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set
before performing lockdep annotations and ends up feeding
uninitialized lockdep_map to lockdep triggering warning like the
following on USB stick hotunplug.

usb 1-2: USB disconnect, device number 2
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002
ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002
ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710
Call Trace:
[<ffffffff81cfb6bd>] dump_stack+0x4e/0x7a
[<ffffffff810f8530>] __lock_acquire+0x1910/0x1e70
[<ffffffff810f931a>] lock_acquire+0x9a/0x1d0
[<ffffffff8127c75e>] kernfs_deactivate+0xee/0x130
[<ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60
[<ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0
[<ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80
[<ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0
[<ffffffff8127b873>] sysfs_remove_groups+0x33/0x50
[<ffffffff8177d66d>] device_remove_attrs+0x4d/0x80
[<ffffffff8177e25e>] device_del+0x12e/0x1d0
[<ffffffff819722c2>] usb_disconnect+0x122/0x1a0
[<ffffffff819749b5>] hub_thread+0x3c5/0x1290
[<ffffffff810c6a6d>] kthread+0xed/0x110
[<ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0

Fix it by making kernfs_deactivate() perform lockdep annotations only
if KERNFS_LOCKDEP is set.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Jiri Kosina <jkosina@suse.cz>
Reported-by: Dave Jones <davej@redhat.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# db4aad20 17-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: associate a new kernfs_node with its parent on creation

Once created, a kernfs_node is always destroyed by kernfs_put().
Since ba7443bc656e ("sysfs, kernfs: implement
kernfs_create/destroy_root()"), kernfs_put() depends on kernfs_root()
to locate the ino_ida. kernfs_root() in turn depends on
kernfs_node->parent being set for !dir nodes. This means that
kernfs_put() of a !dir node requires its ->parent to be initialized.

This leads to oops when a newly created !dir node is destroyed without
going through kernfs_add_one() or after failing kernfs_add_one()
before ->parent is set. kernfs_root() invoked from kernfs_put() will
try to dereference NULL parent.

Fix it by moving parent association to kernfs_new_node() from
kernfs_add_one(). kernfs_new_node() now takes @parent instead of
@root and determines the root from the parent and also sets the new
node's parent properly. @parent parameter is removed from
kernfs_add_one(). As there's no parent when creating the root node,
__kernfs_new_node() which takes @root as before and doesn't set the
parent is used in that case.

This ensures that a kernfs_node in any stage in its life has its
parent associated and thus can be put.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 87da1493 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"

This reverts commit ea1c472dfeada211a0100daa7976e8e8e779b858.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0890147f 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"

This reverts commit a69d001cfc712b96ec9d7ba44d6285702a38dabf.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 798c75a0 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove KERNFS_REMOVED"

This reverts commit ae34372eb8408b3d07e870f1939f99007a730d28.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4f4b1b64 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: restructure removal path to fix possible premature return"

This reverts commit 45a140e587f3d32d8d424ed940dffb61e1739047.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 55f6e30d 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"

This reverts commit f601f9a2bf7dc1f7ee18feece4c4e2fc6845d6c4.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7653fe9d 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove kernfs_addrm_cxt"

This reverts commit 99177a34110889a8f2c36420c34e3bcc9bfd8a70.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f4b3e631 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"

This reverts commit 895a068a524e134900b9d98b519309b7aae7bbb1.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9b0925a6 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: implement kernfs_{de|re}activate[_self]()"

This reverts commit 9f010c2ad5194a4b682e747984477850fabd03be.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a9f138b0 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"

This reverts commit 1ae06819c77cff1ea2833c94f8c093fe8a5c79db.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ce9b499c 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"

This reverts commit 88533f990c616cf50c2fe585ea03f75c806a293d.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 88533f99 11-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove unnecessary NULL check in __kernfs_remove()

895a068a524e ("kernfs: make kernfs_get_active() block if the node is
deactivated but not removed") added "struct kernfs_root *root =
kernfs_root(kn);" at the head of the function; however, the parameter
@kn is checked for later implying that the function may be called with
NULL. This means that we may end up invoking kernfs_root() with NULL
which will oops. None of the existing users invokes removal with NULL
@kn, so this bug doesn't actually trigger.

We can relocate kernfs_root() invocation after NULL check; however,
allowing NULL param tends to cause more confusion than actually
helping anything. As there's no existing user, let's remove the
spurious NULL check.

This bug was detected by smatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1ae06819 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers

Sometimes it's necessary to implement a node which wants to delete
nodes including itself. This isn't straightforward because of kernfs
active reference. While a file operation is in progress, an active
reference is held and kernfs_remove() waits for all such references to
drain before completing. For a self-deleting node, this is a deadlock
as kernfs_remove() ends up waiting for an active reference that itself
is sitting on top of.

This currently is worked around in the sysfs layer using
sysfs_schedule_callback() which makes such removals asynchronous.
While it works, it's rather cumbersome and inherently breaks
synchronicity of the operation - the file operation which triggered
the operation may complete before the removal is finished (or even
started) and the removal may fail asynchronously. If a removal
operation is immmediately followed by another operation which expects
the specific name to be available (e.g. removal followed by rename
onto the same name), there's no way to make the latter operation
reliable.

The thing is there's no inherent reason for this to be asynchrnous.
All that's necessary to do this synchronous is a dedicated operation
which drops its own active ref and deactivates self. This patch
implements kernfs_remove_self() and its wrappers in sysfs and driver
core. kernfs_remove_self() is to be called from one of the file
operations, drops the active ref and deactivates using
__kernfs_deactivate_self(), removes the self node, and restores active
ref to the dead node using __kernfs_reactivate_self() so that the ref
is balanced afterwards. __kernfs_remove() is updated so that it takes
an early exit if the target node is already fully removed so that the
active ref restored by kernfs_remove_self() after removal doesn't
confuse the deactivation path.

This makes implementing self-deleting nodes very easy. The normal
removal path doesn't even need to be changed to use
kernfs_remove_self() for the self-deleting node. The method can
invoke kernfs_remove_self() on itself before proceeding the normal
removal path. kernfs_remove() invoked on the node by the normal
deletion path will simply be ignored.

This will replace sysfs_schedule_callback(). A subtle feature of
sysfs_schedule_callback() is that it collapses multiple invocations -
even if multiple removals are triggered, the removal callback is run
only once. An equivalent effect can be achieved by testing the return
value of kernfs_remove_self() - only the one which gets %true return
value should proceed with actual deletion. All other instances of
kernfs_remove_self() will wait till the enclosing kernfs operation
which invoked the winning instance of kernfs_remove_self() finishes
and then return %false. This trivially makes all users of
kernfs_remove_self() automatically show correct synchronous behavior
even when there are multiple concurrent operations - all "echo 1 >
delete" instances will finish only after the whole operation is
completed by one of the instances.

v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
and sysfs_remove_file_self() had incorrect return type. Fix it.
Reported by kbuild test bot.

v3: Updated to use __kernfs_{de|re}activate_self().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9f010c2a 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_{de|re}activate[_self]()

This patch implements four functions to manipulate deactivation state
- deactivate, reactivate and the _self suffixed pair. A new fields
kernfs_node->deact_depth is added so that concurrent and nested
deactivations are handled properly. kernfs_node->hash is moved so
that it's paired with the new field so that it doesn't increase the
size of kernfs_node.

A kernfs user's lock would normally nest inside active ref but during
removal the user may want to perform kernfs_remove() while holding the
said lock, which would introduce a reverse locking dependency. This
function can be used to break such reverse dependency by allowing
deactivation step to performed separately outside user's critical
section.

This will also be used implement kernfs_remove_self().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 895a068a 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_get_active() block if the node is deactivated but not removed

Currently, kernfs_get_active() fails if the target node is
deactivated. This is fine as a node always gets removed after
deactivation; however, we're gonna add reactivation so the assumption
won't hold. It'd be incorrect for kernfs_get_active() to fail for a
node which was deactivated only temporarily.

This patch makes kernfs_get_active() block if the node is deactivated
but not removed. If the node gets reactivated (not yet implemented),
it will be retried and succeed. If the node gets removed, it will be
woken up and fail.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 99177a34 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove kernfs_addrm_cxt

kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes. The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.

This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().

* kernfs_add_one() is updated to grab and release the parent's active
ref and kernfs_mutex itself. kernfs_get/put_active() and
kernfs_addrm_start/finish() invocations around it are removed from
all users.

* __kernfs_remove() puts an unlinked node directly instead of chaining
it to kernfs_addrm_cxt. Its callers are updated to grab and release
kernfs_mutex instead of calling kernfs_addrm_start/finish() around
it.

v2: Updated to fit the v2 restructuring of removal path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f601f9a2 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()

kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
the target file before kernfs_remove() finishes; however, it currently
is being called from kernfs_addrm_finish() and has the same race
problem as the original implementation of deactivation when there are
multiple removers - only the remover which snatches the node to its
addrm_cxt->removed list is guaranteed to wait for its completion
before returning.

It can be fixed by moving kernfs_unmap_bin_file() invocation from
kernfs_addrm_finish() to __kernfs_remove(). The function may be
called multiple times but that shouldn't do any harm.

We end up dropping kernfs_mutex in the removal loop and the node may
be removed inbetween by someone else. kernfs_unlink_sibling() is
updated to test whether the node has already been removed and return
accordingly. __kernfs_remove() in turn performs post-unlinking
cleanup only if it actually unlinked the node.

KERNFS_HAS_MMAP test is moved out of the unmap function into
__kernfs_remove() so that we don't unlock kernfs_mutex unnecessarily.
While at it, drop the now meaningless "bin" qualifier from the
function name.

v2: Rewritten to fit the v2 restructuring of removal path. HAS_MMAP
test relocated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 45a140e5 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: restructure removal path to fix possible premature return

The recursive nature of kernfs_remove() means that, even if
kernfs_remove() is not allowed to be called multiple times on the same
node, there may be race conditions between removal of parent and its
descendants. While we can claim that kernfs_remove() shouldn't be
called on one of the descendants while the removal of an ancestor is
in progress, such rule is unnecessarily restrictive and very difficult
to enforce. It's better to simply allow invoking kernfs_remove() as
the caller sees fit as long as the caller ensures that the node is
accessible.

The current behavior in such situations is broken. Whoever enters
removal path first takes the node off the hierarchy and then
deactivates. Following removers either return as soon as it notices
that it's not the first one or can't even find the target node as it
has already been removed from the hierarchy. In both cases, the
following removers may finish prematurely while the nodes which should
be removed and drained are still being processed by the first one.

This patch restructures so that multiple removers, whether through
recursion or direction invocation, always follow the following rules.

* When there are multiple concurrent removers, only one puts the base
ref.

* Regardless of which one puts the base ref, all removers are blocked
until the target node is fully deactivated and removed.

To achieve the above, removal path now first deactivates the subtree,
drains it and then unlinks one-by-one. __kernfs_deactivate() is
called directly from __kernfs_removal() and drops and regrabs
kernfs_mutex for each descendant to drain active refs. As this means
that multiple removers can enter __kernfs_deactivate() for the same
node, the function is updated so that it can handle multiple
deactivators of the same node - only one actually deactivates but all
wait till drain completion.

The restructured removal path guarantees that a removed node gets
unlinked only after the node is deactivated and drained. Combined
with proper multiple deactivator handling, this guarantees that any
invocation of kernfs_remove() returns only after the node itself and
all its descendants are deactivated, drained and removed.

v2: Draining separated into a separate loop (used to be in the same
loop as unlink) and done from __kernfs_deactivate(). This is to
allow exposing deactivation as a separate interface later.

Root node removal was broken in v1 patch. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ae34372e 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_REMOVED

KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps those of deactivation and
removal from rbtree.

It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations. There's no reason to have them separate making
things more complex than necessary.

KERNFS_REMOVED is also used to decide whether a node is still visible
to vfs layer, which is rather redundant as equivalent determination
can be made by testing whether the node is on its parent's children
rbtree or not.

This patch removes KERNFS_REMOVED.

* Instead of KERNFS_REMOVED, each node now starts its life
deactivated. This means that we now use both atomic_add() and
atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
generates an overflow warnings when negating INT_MIN as the negation
can't be represented as a positive number. Nothing is actually
broken but let's bump BIAS by one to avoid the warnings for archs
which negates the subtrahend..

* KERNFS_REMOVED tests in add and rename paths are replaced with
kernfs_get/put_active() of the target nodes. Due to the way the add
path is structured now, active ref handling is done in the callers
of kernfs_add_one(). This will be consolidated up later.

* kernfs_remove_one() is updated to deactivate instead of setting
KERNFS_REMOVED. This removes deactivation from kernfs_deactivate(),
which is now renamed to kernfs_drain().

* kernfs_dop_revalidate() now tests RB_EMPTY_NODE(&kn->rb) instead of
KERNFS_REMOVED and KERNFS_REMOVED test in kernfs_dir_pos() is
dropped. A node which is removed from the children rbtree is not
included in the iteration in the first place. This means that a
node may be visible through vfs a bit longer - it's now also visible
after deactivation until the actual removal. This slightly enlarged
window difference doesn't make any difference to the userland.

* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
checks on the active ref.

* Some comment style updates in the affected area.

v2: Reordered before removal path restructuring. kernfs_active()
dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
used in the lookup paths.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a69d001c 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()

There currently are two mechanisms gating active ref lockdep
annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
The former disables lockdep annotations in kernfs_get/put_active()
while the latter disables all of kernfs_deactivate().

While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
deactivation step for non-file nodes, the benefit is marginal and it
needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF and use
KERNFS_LOCKDEP in kernfs_deactivate() too.

While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
flag so that it's more convenient and the related code can be compiled
out when not enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ea1c472d 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq

kernfs_node->u.completion is used to notify deactivation completion
from kernfs_put_active() to kernfs_deactivate(). We now allow
multiple racing removals of the same node and the current removal
scheme is no longer correct - kernfs_remove() invocation may return
before the node is properly deactivated if it races against another
removal. The removal path will be restructured to address the issue.

To help such restructure which requires supporting multiple waiters,
this patch replaces kernfs_node->u.completion with
kernfs_root->deactivate_waitq. This makes deactivation event
notifications share a per-root waitqueue_head; however, the wait path
is quite cold and this will also allow shaving one pointer off
kernfs_node.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 80b9bbef 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: add kernfs_dir_ops

Add support for mkdir(2), rmdir(2) and rename(2) syscalls. This is
implemented through optional kernfs_dir_ops callback table which can
be specified on kernfs_create_root(). An implemented callback is
invoked when the matching syscall is invoked.

As kernfs keep dcache syncs with internal representation and
revalidates dentries on each access, the implementation of these
methods is extremely simple. Each just discovers the relevant
kernfs_node(s) and invokes the requested callback which is allowed to
do any kernfs operations and the end result doesn't necessarily have
to match the expected semantics of the syscall.

This will be used to convert cgroup to use kernfs instead of its own
filesystem implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 19bbb926 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: allow negative dentries

kernfs doesn't allow negative dentries - kernfs_iop_lookup() returns
ERR_PTR(-ENOENT) instead of NULL which short-circuits negative dentry
creation and kernfs's d_delete() callback, kernfs_dop_delete(),
returns 1 for all removed nodes. This in turn allows
kernfs_dop_revalidate() to assume that there's no negative dentry for
kernfs.

This worked fine for sysfs but kernfs is scheduled to grow mkdir(2)
support which depend on negative dentries. This patch updates so that
kernfs allows negative dentries. The required changes are almost
trivial - kernfs_iop_lookup() now returns NULL instead of
ERR_PTR(-ENOENT) when the target kernfs_node doesn't exist,
kernfs_dop_delete() is removed and kernfs_dop_revalidate() is updated
to check whether the target dentry is negative and request fresh
lookup if so.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 47a52e91 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: update kernfs_rename_ns() to consider KERNFS_STATIC_NAME

kernfs_rename_ns() currently assumes that the target sysfs_dirent has
a copied name. This has been okay because sysfs supports rename only
for directories which always have copied names; however, there's
nothing in kernfs interface which calls for such restriction and
currently invoking kernfs_rename_ns() on a regular file leads to oops
because it ends up trying to kfree() a static name.

This patch updates kernfs_rename_ns() so that it skips kfree() of the
old name if it's static. This allows it to be used for all node
types.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2063d608 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: mark static names with KERNFS_STATIC_NAME

Because sysfs used struct attribute which are supposed to stay
constant, sysfs didn't copy names when creating regular files. The
specified string for name was supposed to stay constant. Such
distinction isn't inherent for kernfs. kernfs_create_file[_ns]()
should be able to take the same @name as kernfs_create_dir[_ns]()

As there can be huge number of sysfs attributes, we still want to be
able to use static names for sysfs attributes. This patch renames
kernfs_create_file_ns_key() to __kernfs_create_file() and adds
@name_is_static parameter so that the caller can explicitly indicate
that @name can be used without copying. kernfs is updated to use
KERNFS_STATIC_NAME to distinguish static and copied names.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# d0ae3d43 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: add REMOVED check to create and rename paths

kernfs currently assumes that the caller doesn't try to create a new
node under a removed parent, rename a removed node, or move a node
under a removed node. While this works fine for sysfs, it'd be nice
to have protection against such cases especially given that kernfs is
planned to add support for mkdir, rmdir and rename requsts from
userland which may make race conditions more likely.

This patch updates create and rename paths to check REMOVED and fail
the operation with -ENOENT if performed on or towards removed nodes.
Note that remove path already has such check.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bb8b9d09 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: add @mode to kernfs_create_dir[_ns]()

sysfs assumed 0755 for all newly created directories and kernfs
inherited it. This assumption is unnecessarily restrictive and
inconsistent with kernfs_create_file[_ns](). This patch adds @mode
parameter to kernfs_create_dir[_ns]() and update uses in sysfs
accordingly. Among others, this will be useful for implementations of
the planned ->mkdir() method.

This patch doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c637b8ac 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in internal functions and whatever is left

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_*()/kernfs_*()/ in all internal functions
* s/sysfs/kernfs/ in internal strings, comments and whatever is remaining
* Uniformly rename various vfs operations so that they're consistently
named and distinguishable.

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a797bfc3 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in global variables

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_mutex/kernfs_mutex/
* s/sysfs_dentry_ops/kernfs_dops/
* s/sysfs_dir_operations/kernfs_dir_fops/
* s/sysfs_dir_inode_operations/kernfs_dir_iops/
* s/kernfs_file_operations/kernfs_file_fops/ - renamed for consistency
* s/sysfs_symlink_inode_operations/kernfs_symlink_iops/
* s/sysfs_aops/kernfs_aops/
* s/sysfs_backing_dev_info/kernfs_bdi/
* s/sysfs_inode_operations/kernfs_iops/
* s/sysfs_dir_cachep/kernfs_node_cache/
* s/sysfs_ops/kernfs_sops/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# df23fc39 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in constants

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/SYSFS_DIR/KERNFS_DIR/
* s/SYSFS_KOBJ_ATTR/KERNFS_FILE/
* s/SYSFS_KOBJ_LINK/KERNFS_LINK/
* s/SYSFS_{TYPE_FLAGS}/KERNFS_{TYPE_FLAGS}/
* s/SYSFS_FLAG_{FLAG}/KERNFS_{FLAG}/
* s/sysfs_type()/kernfs_type()/
* s/SD_DEACTIVATED_BIAS/KN_DEACTIVATED_BIAS/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c525aadd 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in various data structures

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_open_dirent/kernfs_open_node/
* s/sysfs_open_file/kernfs_open_file/
* s/sysfs_inode_attrs/kernfs_iattrs/
* s/sysfs_addrm_cxt/kernfs_addrm_cxt/
* s/sysfs_super_info/kernfs_super_info/
* s/sysfs_info()/kernfs_info()/
* s/sysfs_open_dirent_lock/kernfs_open_node_lock/
* s/sysfs_open_file_mutex/kernfs_open_file_mutex/
* s/sysfs_of()/kernfs_of()/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# adc5e8b5 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: drop s_ prefix from kernfs_node members

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

s_ prefix for kernfs members is used inconsistently and a misnomer
now. It's not like kernfs_node is used widely across the kernel
making the ability to grep for the members particularly useful. Let's
just drop the prefix.

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 324a56e1 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above

Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them. All are sysfs_dirent renames and trivial. While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.

This patch is strictly rename only and doesn't introduce any
functional difference.

- mic / gpio renames were missing. Spotted by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2322392b 23-Nov-2013 Tejun Heo <tj@kernel.org>

kernfs: implement "trusted.*" xattr support

kernfs inherited "security.*" xattr support from sysfs. This patch
extends xattr support to "trusted.*" using simple_xattr_*(). As
trusted xattrs are restricted to CAP_SYS_ADMIN, simple_xattr_*() which
uses kernel memory for storage shouldn't be problematic.

Note that the existing "security.*" support doesn't implement
get/remove/list and the this patch only implements those ops for
"trusted.*". We probably want to extend those ops to include support
for "security.*".

This patch will allow using kernfs from cgroup which requires
"trusted.*" xattr support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ac9bba03 29-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: implement kernfs_ns_enabled()

fs/sysfs/symlink.c::sysfs_delete_link() tests @sd->s_flags for
SYSFS_FLAG_NS. Let's add kernfs_ns_enabled() so that sysfs doesn't
have to test sysfs_dirent flag directly. This makes things tidier for
kernfs proper too.

This is purely cosmetic.

v2: To avoid possible NULL deref, use noop dummy implementation which
always returns false when !CONFIG_SYSFS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# cf9e5a73 29-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: make sysfs_dirent definition public

sysfs_dirent includes some information which should be available to
kernfs users - the type, flags, name and parent pointer. This patch
moves sysfs_dirent definition from kernfs/kernfs-internal.h to
include/linux/kernfs.h so that kernfs users can access them.

The type part of flags is exported as enum kernfs_node_type, the flags
kernfs_node_flag, sysfs_type() and kernfs_enable_ns() are moved to
include/linux/kernfs.h and the former is updated to return the enum
type. sysfs_dirent->s_parent and ->s_name are marked explicitly as
public.

This patch doesn't introduce any functional changes.

v2: Flags exported too and kernfs_enable_ns() definition moved.

v3: While moving kernfs_enable_ns() to include/linux/kernfs.h, v1 and
v2 put the definition outside CONFIG_SYSFS replacing the dummy
implementation with the actual implementation too. Unfortunately,
this can lead to oops when !CONFIG_SYSFS because
kernfs_enable_ns() may be called on a NULL @sd and now tries to
dereference @sd instead of not doing anything. This issue was
reported by Yuanhan Liu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bc755553 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: make inode number ida per kernfs_root

kernfs is being updated to allow multiple sysfs_dirent hierarchies so
that it can also be used by other users. Currently, inode number is
allocated using a global ida, sysfs_ino_ida; however, inos for
different hierarchies should be handled separately.

This patch makes ino allocation per kernfs_root. sysfs_ino_ida is
replaced by kernfs_root->ino_ida and sysfs_new_dirent() is updated to
take @root and allocate ino from it. ida_simple_get/remove() are used
instead of sysfs_ino_lock and sysfs_alloc/free_ino().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ba7443bc 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: implement kernfs_create/destroy_root()

There currently is single kernfs hierarchy in the whole system which
is used for sysfs. kernfs needs to support multiple hierarchies to
allow other users. This patch introduces struct kernfs_root which
serves as the root of each kernfs hierarchy and implements
kernfs_create/destroy_root().

* Each kernfs_root is associated with a root sd (sysfs_dentry). The
root is freed when the root sd is released and kernfs_destory_root()
simply invokes kernfs_remove() on the root sd. sysfs_remove_one()
is updated to handle release of the root sd. Note that ps_iattr
update in sysfs_remove_one() is trivially updated for readability.

* Root sd's are now dynamically allocated using sysfs_new_dirent().
Update sysfs_alloc_ino() so that it gives out ino from 1 so that the
root sd still gets ino 1.

* While kernfs currently only points to the root sd, it'll soon grow
fields which are specific to each hierarchy. As determining a given
sd's root will be necessary, sd->s_dir.root is added. This backlink
fits better as a separate field in sd; however, sd->s_dir is inside
union with space to spare, so use it to save space and provide
kernfs_root() accessor to determine the root sd.

* As hierarchies may be destroyed now, each mount needs to hold onto
the hierarchy it's attached to. Update sysfs_fill_super() and
sysfs_kill_sb() so that they get and put the kernfs_root
respectively.

* sysfs_root is replaced with kernfs_root which is dynamically created
by invoking kernfs_create_root() from sysfs_init().

This patch doesn't introduce any visible behavior changes.

v2: kernfs_create_root() forgot to set @sd->priv. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# fd7b9f7b 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: move dir core code to fs/kernfs/dir.c

Move core dir code to fs/kernfs/dir.c. fs/sysfs/dir.c now only
contains sysfs_warn_dup() and sysfs wrappers around kernfs interfaces.
The respective declarations in fs/sysfs/sysfs.h are moved to
fs/kernfs/kernfs-internal.h.

This is pure relocation.

v2: sysfs_symlink_target_lock was mistakenly relocated to kernfs. It
should remain with sysfs. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# b8441ed2 24-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: add skeletons for kernfs

Core sysfs implementation will be separated into kernfs so that it can
be used by other non-kobject users.

This patch creates fs/kernfs/ directory and makes boilerplate changes.
kernfs interface will be directly based on sysfs_dirent and its
forward declaration is moved to include/linux/kernfs.h which is
included from include/linux/sysfs.h. sysfs core implementation will
be gradually separated out and moved to kernfs.

This patch doesn't introduce any functional changes.

v2: mount.c added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>