History log of /linux-master/include/linux/kernfs.h
Revision Date Author Comments
# 4207b556 09-Jan-2024 Tejun Heo <tj@kernel.org>

kernfs: RCU protect kernfs_nodes and avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()

The BPF helper bpf_cgroup_from_id() calls kernfs_find_and_get_node_by_id()
which acquires kernfs_idr_lock, which is an non-raw non-IRQ-safe lock. This
can lead to deadlocks as bpf_cgroup_from_id() can be called from any BPF
programs including e.g. the ones that attach to functions which are holding
the scheduler rq lock.

Consider the following BPF program:

SEC("fentry/__set_cpus_allowed_ptr_locked")
int BPF_PROG(__set_cpus_allowed_ptr_locked, struct task_struct *p,
struct affinity_context *affn_ctx, struct rq *rq, struct rq_flags *rf)
{
struct cgroup *cgrp = bpf_cgroup_from_id(p->cgroups->dfl_cgrp->kn->id);

if (cgrp) {
bpf_printk("%d[%s] in %s", p->pid, p->comm, cgrp->kn->name);
bpf_cgroup_release(cgrp);
}
return 0;
}

__set_cpus_allowed_ptr_locked() is called with rq lock held and the above
BPF program calls bpf_cgroup_from_id() within leading to the following
lockdep warning:

=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.7.0-rc3-work-00053-g07124366a1d7-dirty #147 Not tainted
-----------------------------------------------------
repro/1620 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffffffff833b3688 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1e/0x70

and this task is already holding:
ffff888237ced698 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0x4e/0xf0
which would create a new lock dependency:
(&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2}
...
Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(kernfs_idr_lock);
local_irq_disable();
lock(&rq->__lock);
lock(kernfs_idr_lock);
<Interrupt>
lock(&rq->__lock);

*** DEADLOCK ***
...
Call Trace:
dump_stack_lvl+0x55/0x70
dump_stack+0x10/0x20
__lock_acquire+0x781/0x2a40
lock_acquire+0xbf/0x1f0
_raw_spin_lock+0x2f/0x40
kernfs_find_and_get_node_by_id+0x1e/0x70
cgroup_get_from_id+0x21/0x240
bpf_cgroup_from_id+0xe/0x20
bpf_prog_98652316e9337a5a___set_cpus_allowed_ptr_locked+0x96/0x11a
bpf_trampoline_6442545632+0x4f/0x1000
__set_cpus_allowed_ptr_locked+0x5/0x5a0
sched_setaffinity+0x1b3/0x290
__x64_sys_sched_setaffinity+0x4f/0x60
do_syscall_64+0x40/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e

Let's fix it by protecting kernfs_node and kernfs_root with RCU and making
kernfs_find_and_get_node_by_id() acquire rcu_read_lock() instead of
kernfs_idr_lock.

This adds an rcu_head to kernfs_node making it larger by 16 bytes on 64bit.
Combined with the preceding rearrange patch, the net increase is 8 bytes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20240109214828.252092-4-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1c9f2c76 10-Jan-2024 Tejun Heo <tj@kernel.org>

kernfs: Rearrange kernfs_node fields to reduce its size on 64bit

Moving .flags and .mode right below .hash makes kernfs_node smaller by 8
bytes on 64bit. To avoid creating a hole from 8 bytes alignment on 32bit
archs, .priv is moved below so that there are two 32bit pointers after the
64bit .id field.

v2: Updated to avoid size increase on 32bit noticed by Geert.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/ZZ7hwA18nfmFjYpj@slm.duckdns.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0fedefd4 25-Sep-2023 Valentine Sinitsyn <valesini@yandex-team.ru>

kernfs: sysfs: support custom llseek method for sysfs entries

As of now, seeking in sysfs files is handled by generic_file_llseek().
There are situations where one may want to customize seeking logic:

- Many sysfs entries are fixed files while generic_file_llseek() accepts
past-the-end positions. Not only being useless by itself, this
also means a bug in userspace code will trigger not at lseek(), but at
some later point making debugging harder.
- generic_file_llseek() relies on f_mapping->host to get the file size
which might not be correct for all sysfs entries.
See commit 636b21b50152 ("PCI: Revoke mappings like devmem") as an example.

Implement llseek method to override this behavior at sysfs attribute
level. The method is optional, and if it is absent,
generic_file_llseek() is called to preserve backwards compatibility.

Signed-off-by: Valentine Sinitsyn <valesini@yandex-team.ru>
Link: https://lore.kernel.org/r/20230925084013.309399-1-valesini@yandex-team.ru
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 79038a99 24-Jul-2023 Arnd Bergmann <arnd@arndb.de>

kernfs: add stub helper for kernfs_generic_poll()

In some randconfig builds, kernfs ends up being disabled, so there is no prototype
for kernfs_generic_poll()

In file included from kernel/sched/build_utility.c:97:
kernel/sched/psi.c:1479:3: error: implicit declaration of function 'kernfs_generic_poll' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
kernfs_generic_poll(t->of, wait);
^

Add a stub helper for it, as we have it for other kernfs functions.

Fixes: aff037078ecae ("sched/psi: use kernfs polling functions for PSI trigger polling")
Fixes: 147e1a97c4a0b ("fs: kernfs: add poll file operation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20230724121823.1357562-1-arnd@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 783bd07d 27-Aug-2022 Tejun Heo <tj@kernel.org>

kernfs: Implement kernfs_show()

Currently, kernfs nodes can be created hidden and activated later by calling
kernfs_activate() to allow creation of multiple nodes to succeed or fail as
a unit. This is an one-way one-time-only transition. This patch introduces
kernfs_show() which can toggle visibility dynamically.

As the currently proposed use - toggling the cgroup pressure files - only
requires operating on leaf nodes, for the sake of simplicity, restrict it as
such for now.

Hiding uses the same mechanism as deactivation and likewise guarantees that
there are no in-flight operations on completion. KERNFS_ACTIVATED and
KERNFS_HIDDEN are used to manage the interactions between activations and
show/hide operations. A node is visible iff both activated & !hidden.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-9-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c2549174 27-Aug-2022 Tejun Heo <tj@kernel.org>

kernfs: Add KERNFS_REMOVING flags

KERNFS_ACTIVATED tracks whether a given node has ever been activated. As a
node was only deactivated on removal, this was used for

1. Drain optimization (removed by the previous patch).
2. To hide !activated nodes
3. To avoid double activations
4. Reject adding children to a node being removed
5. Skip activaing a node which is being removed.

We want to decouple deactivation from removal so that nodes can be
deactivated and hidden dynamically, which makes KERNFS_ACTIVATED useless for
all of the above purposes.

#1 is already gone. #2 and #3 can instead test whether the node is currently
active. A new flag KERNFS_REMOVING is added to explicitly mark nodes which
are being removed for #4 and #5.

While this leaves KERNFS_ACTIVATED with no users, leave it be as it will be
used in a following patch.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-7-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2fd26970 05-Jul-2022 Imran Khan <imran.f.khan@oracle.com>

Revert "kernfs: Change kernfs_notify_list to llist."

This reverts commit b8f35fa1188b84035c59d4842826c4e93a1b1c9f.

This is causing regression due to same kernfs_node getting
added multiple times in kernfs_notify_list so revert it until
safe way of using llist in this context is found.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Michael Walle <michael@walle.cc>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220705201026.2487665-1-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1d25b84e 14-Jun-2022 Imran Khan <imran.f.khan@oracle.com>

kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.

In current kernfs design a single mutex, kernfs_open_file_mutex, protects
the list of kernfs_open_file instances corresponding to a sysfs attribute.
So even if different tasks are opening or closing different sysfs files
they can contend on osq_lock of this mutex. The contention is more apparent
in large scale systems with few hundred CPUs where most of the CPUs have
running tasks that are opening, accessing or closing sysfs files at any
point of time.

Using hashed mutexes in place of a single global mutex, can significantly
reduce contention around global mutex and hence can provide better
scalability. Moreover as these hashed mutexes are not part of kernfs_node
objects we will not see any singnificant change in memory utilization of
kernfs based file systems like sysfs, cgroupfs etc.

Modify interface introduced in previous patch to make use of hashed
mutexes. Use kernfs_node address as hashing key.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20220615021059.862643-5-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# b8f35fa1 14-Jun-2022 Imran Khan <imran.f.khan@oracle.com>

kernfs: Change kernfs_notify_list to llist.

At present kernfs_notify_list is implemented as a singly linked
list of kernfs_node(s), where last element points to itself and
value of ->attr.next tells if node is present on the list or not.
Both addition and deletion to list happen under kernfs_notify_lock.

Change kernfs_notify_list to llist so that addition to list can heppen
locklessly.

Suggested by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20220615021059.862643-3-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 086c00c7 14-Jun-2022 Imran Khan <imran.f.khan@oracle.com>

kernfs: make ->attr.open RCU protected.

After removal of kernfs_open_node->refcnt in the previous patch,
kernfs_open_node_lock can be removed as well by making ->attr.open
RCU protected. kernfs_put_open_node can delegate freeing to ->attr.open
to RCU and other readers of ->attr.open can do so under rcu_read_(un)lock.

Suggested by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20220615021059.862643-2-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7a19006b 18-Mar-2022 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

kernfs: remove unneeded #if 0 guard

Commit f2eb478f2f32 ("kernfs: move struct kernfs_root out of the public
view.") moved kernfs_root out of kernfs.h, but my debugging code of a
#if 0 was left in accidentally. Fix that up by removing the guards.

Fixes: f2eb478f2f32 ("kernfs: move struct kernfs_root out of the public view.")
Cc: Tejun Heo <tj@kernel.org>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20220318073452.1486568-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f2eb478f 22-Feb-2022 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

kernfs: move struct kernfs_root out of the public view.

There is no need to have struct kernfs_root be part of kernfs.h for
the whole kernel to see and poke around it. Move it internal to kernfs
code and provide a helper function, kernfs_root_to_node(), to handle the
one field that kernfs users were directly accessing from the structure.

Cc: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 79f1c730 09-Dec-2021 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

kernfs: Replace kernel.h with the necessary inclusions

When kernel.h is used in the headers it adds a lot into dependency hell,
especially when there are circular dependencies are involved.

Replace kernel.h inclusion with the list of what is really being used.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20211209123008.3391-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 393c3714 18-Nov-2021 Minchan Kim <minchan@kernel.org>

kernfs: switch global kernfs_rwsem lock to per-fs lock

The kernfs implementation has big lock granularity(kernfs_rwsem) so
every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the
lock. It makes trouble for some cases to wait the global lock
for a long time even though they are totally independent contexts
each other.

A general example is process A goes under direct reclaim with holding
the lock when it accessed the file in sysfs and process B is waiting
the lock with exclusive mode and then process C is waiting the lock
until process B could finish the job after it gets the lock from
process A.

This patch switches the global kernfs_rwsem to per-fs lock, which
put the rwsem into kernfs_root.

Suggested-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20211118230008.2679780-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# eaf501e0 12-Sep-2021 Christoph Hellwig <hch@lst.de>

kernfs: remove the unused lockdep_key field in struct kernfs_ops

Not actually used anywhere.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210913054121.616001-4-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 29356624 12-Sep-2021 Christoph Hellwig <hch@lst.de>

kernfs: remove kernfs_create_file and kernfs_create_file_ns

All callers actually use __kernfs_create_file.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210913054121.616001-3-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7ba0273b 16-Jul-2021 Ian Kent <raven@themaw.net>

kernfs: switch kernfs to use an rwsem

The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel during path walks.

Change the kernfs mutex to an rwsem so that, when opportunity arises,
node searches can be done in parallel with path walk lookups.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/162642770946.63632.2218304587223241374.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 895adbec 16-Jul-2021 Ian Kent <raven@themaw.net>

kernfs: add a revision to identify directory node changes

Add a revision counter to kernfs directory nodes so it can be used
to detect if a directory node has changed during negative dentry
revalidation.

There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
on all architectures and as far as I know that assumption holds.

So adding a revision counter to the struct kernfs_elem_dir variant of
the kernfs_node type union won't increase the size of the kernfs_node
struct. This is because struct kernfs_elem_dir is at least
sizeof(pointer) smaller than the largest union variant. It's tempting
to make the revision counter a u64 but that would increase the size of
kernfs_node on archs where sizeof(pointer) is smaller than the revision
counter.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/162642769895.63632.8356662784964509867.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 21774fd8 15-Oct-2020 Willem de Bruijn <willemb@google.com>

kernfs: bring names in comments in line with code

Fix two stragglers in the comments of the below rename operation.

Fixes: adc5e8b58f48 ("kernfs: drop s_ prefix from kernfs_node members")
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201015185726.1386868-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0c47383b 12-Mar-2020 Daniel Xu <dxu@dxuuu.xyz>

kernfs: Add option to enable user xattrs

User extended attributes are useful as metadata storage for kernfs
consumers like cgroups. Especially in the case of cgroups, it is useful
to have a central metadata store that multiple processes/services can
use to coordinate actions.

A concrete example is for userspace out of memory killers. We want to
let delegated cgroup subtree owners (running as non-root) to be able to
say "please avoid killing this cgroup". This is especially important for
desktop linux as delegated subtrees owners are less likely to run as
root.

This patch introduces a new flag, KERNFS_ROOT_SUPPORT_USER_XATTR, that
lets kernfs consumers enable user xattr support. An initial limit of 128
entries or 128KB -- whichever is hit first -- is placed per cgroup
because xattrs come from kernel memory and we don't want to let
unprivileged users accidentally eat up too much kernel memory.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 40430452 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: use 64bit inos if ino_t is 64bit

Each kernfs_node is identified with a 64bit ID. The low 32bit is
exposed as ino and the high gen. While this already allows using inos
as keys by looking up with wildcard generation number of 0, it's
adding unnecessary complications for 64bit ino archs which can
directly use kernfs_node IDs as inos to uniquely identify each cgroup
instance.

This patch exposes IDs directly as inos on 64bit ino archs. The
conversion is mostly straight-forward.

* 32bit ino archs behave the same as before. 64bit ino archs now use
the whole 64bit ID as ino and the generation number is fixed at 1.

* 64bit inos still use the same idr allocator which gurantees that the
lower 32bits identify the current live instance uniquely and the
high 32bits are incremented whenever the low bits wrap. As the
upper 32bits are no longer used as gen and we don't wanna start ino
allocation with 33rd bit set, the initial value for highbits
allocation is changed to 0 on 64bit ino archs.

* blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify
the issuing cgroup. Userland builds FILEID_INO32_GEN fids from
these numbers to look up the cgroups. To remain compatible with the
behavior, always output (LOW32,HIGH32) which will be constructed
back to the original 64bit ID by __kernfs_fh_to_dentry().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>


# fe0f726c 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()

kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
specified ino. On top of that, kernfs_get_node_by_id() and
kernfs_fh_get_inode() implement full ID matching by testing the rest
of ID.

On surface, confusingly, the two are slightly different in that the
latter uses 0 gen as wildcard while the former doesn't - does it mean
that the latter can't uniquely identify inodes w/ 0 gen? In practice,
this is a distinction without a difference because generation number
starts at 1. There are no actual IDs with 0 gen, so it can always
safely used as wildcard.

Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
and removing now unnecessary kernfs_get_node_by_id().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 67c0496e 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: convert kernfs_node->id from union kernfs_node_id to u64

kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value. I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to. Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper. Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen. This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>


# e23f568a 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: fix ino wrap-around detection

When the 32bit ino wraps around, kernfs increments the generation
number to distinguish reused ino instances. The wrap-around detection
tests whether the allocated ino is lower than what the cursor but the
cursor is pointing to the next ino to allocate so the condition never
triggers.

Fix it by remembering the last ino and comparing against that.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 4a3ef68acacf ("kernfs: implement i_generation")
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@vger.kernel.org # v4.14+


# 55716d26 01-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428

Based on 1 normalized pattern(s):

this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0d1a393d 02-Apr-2019 Christina Quast <cquast@hanoverdisplays.com>

fs: kernfs: Corrected spelling mistake

flies => files

Signed-off-by: Christina Quast <cquast@hanoverdisplays.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1537ad15 03-Apr-2019 Ondrej Mosnacek <omosnace@redhat.com>

kernfs: fix xattr name handling in LSM helpers

The implementation of kernfs_security_xattr_*() helpers reuses the
kernfs_node_xattr_*() functions, which take the suffix of the xattr name
and extract full xattr name from it using xattr_full_name(). However,
this function relies on the fact that the suffix passed to xattr
handlers from VFS is always constructed from the full name by just
incerementing the pointer. This doesn't necessarily hold for the callers
of kernfs_security_xattr_*(), so their usage will easily lead to
out-of-bounds access.

Fix this by moving the xattr name reconstruction to the VFS xattr
handlers and replacing the kernfs_security_xattr_*() helpers with more
general kernfs_xattr_*() helpers that take full xattr name and allow
accessing all kernfs node's xattrs.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: b230d5aba2d1 ("LSM: add new hook for kernfs node initialization")
Fixes: ec882da5cda9 ("selinux: implement the kernfs_init_security hook")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>


# b230d5ab 22-Feb-2019 Ondrej Mosnacek <omosnace@redhat.com>

LSM: add new hook for kernfs node initialization

This patch introduces a new security hook that is intended for
initializing the security data for newly created kernfs nodes, which
provide a way of storing a non-default security context, but need to
operate independently from mounts (and therefore may not have an
associated inode at the moment of creation).

The main motivation is to allow kernfs nodes to inherit the context of
the parent under SELinux, similar to the behavior of
security_inode_init_security(). Other LSMs may implement their own logic
for handling the creation of new nodes.

This patch also adds helper functions to <linux/kernfs.h> for
getting/setting security xattrs of a kernfs node so that LSMs hooks are
able to do their job. Other important attributes should be accessible
direcly in the kernfs_node fields (in case there is need for more, then
new helpers should be added to kernfs.h along with the patch that needs
them).

Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: more manual merge fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 147e1a97 05-Mar-2019 Johannes Weiner <hannes@cmpxchg.org>

fs: kernfs: add poll file operation

Patch series "psi: pressure stall monitors", v3.

Android is adopting psi to detect and remedy memory pressure that
results in stuttering and decreased responsiveness on mobile devices.

Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible. Psi also
doesn't aggregate its averages at a high enough frequency right now.

This patch series extends the psi interface such that users can
configure sensitive latency thresholds and use poll() and friends to be
notified when these are breached.

As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaging, and makes
the aggregation frequency adaptive, such that high-frequency updates
only happen while monitored stall events are actively occurring.

With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user. For
example, using memory stall monitors in userspace low memory killer
daemon (lmkd) we can detect mounting pressure and kill less important
processes before device becomes visibly sluggish.

In our memory stress testing psi memory monitors produce roughly 10x
less false positives compared to vmpressure signals. Having ability to
specify multiple triggers for the same psi metric allows other parts of
Android framework to monitor memory state of the device and act
accordingly.

The new interface is straightforward. The user opens one of the
pressure files for writing and writes a trigger description into the
file descriptor that defines the stall state - some or full, and the
maximum stall time over a given window of time. E.g.:

/* Signal when stall time exceeds 100ms of a 1s window */
char trigger[] = "full 100000 1000000";
fd = open("/proc/pressure/memory");
write(fd, trigger, sizeof(trigger));
while (poll() >= 0) {
...
}
close(fd);

When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in order
to emit event signals in a timely fashion. Once the stalling subsides,
aggregation reverts back to normal.

The trigger is associated with the open file descriptor. To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.

Patches 1-4 prepare the psi code for polling support. Patch 5
implements the adaptive polling logic, the pressure growth detection
optimized for short intervals, and hooks up write() and poll() on the
pressure files.

The patches were developed in collaboration with Johannes Weiner.

This patch (of 5):

Kernfs has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling for
custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23bf1b6b 01-Nov-2018 David Howells <dhowells@redhat.com>

kernfs, sysfs, cgroup, intel_rdt: Support fs_context

Make kernfs support superblock creation/mount/remount with fs_context.

This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.

Notes:

(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).

(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired

(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.

(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.

(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.

(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.

(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.

Weirdies:

(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.

(*) The cgroup refcount web. This really needs documenting.

(*) cgroup2 only has one root?

Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.

[folded a leak fix from Andrey Vagin]

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6d7fbce7 16-Jan-2019 Al Viro <viro@zeniv.linux.org.uk>

kill kernfs_pin_sb()

unused now and impossible to use safely anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8f5be0ec 13-Aug-2018 Konstantin Khlebnikov <koct9i@gmail.com>

kernfs: update comment about kernfs_path() return value

Now it returns the length of the full path or error code.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 3abb1d90f5d9 ("kernfs: make kernfs_path*() behave in the style of strlcpy()")
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 488dee96 20-Jul-2018 Dmitry Torokhov <dmitry.torokhov@gmail.com>

kernfs: allow creating kernfs objects with arbitrary uid/gid

This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.

When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.

Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 69fd5c39 12-Jul-2017 Shaohua Li <shli@fb.com>

blktrace: add an option to allow displaying cgroup path

By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd]

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aa818825 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: add exportfs operations

Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c53cd490 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: introduce kernfs_node_id

inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4a3ef68a 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: implement i_generation

Set i_generation for kernfs inode. This is required to implement
exportfs operations. The generation is 32-bit, so it's possible the
generation wraps up and we find stale files. To reduce the posssibility,
we don't reuse inode numer immediately. When the inode number allocation
wraps, we increase generation number. In this way generation/inode
number consist of a 64-bit number which is unlikely duplicated. This
does make the idr tree more sparse and waste some memory. Since idr
manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
bits of the key. In a 100k inode kernfs, the worst case will have around
300k radix tree node. Each node is 576bytes, so the tree will use about
~150M memory. Sounds not too bad, if this really is a problem, we should
find better data structure.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7d35079f 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: use idr instead of ida to manage inode number

kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0e67db2f 27-Dec-2016 Tejun Heo <tj@kernel.org>

kernfs: add kernfs_ops->open/release() callbacks

Add ->open/release() methods to kernfs_ops. ->open() is called when
the file is opened and ->release() when the file is either released or
severed. These callbacks can be used, for example, to manage
persistent caching objects over multiple seq_file iterations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>


# a1d82aff 27-Dec-2016 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_open_file->mmapped a bitfield

More kernfs_open_file->mutex synchronized flags are planned to be
added. Convert ->mmapped to a bitfield in preparation.

While at it, make kernfs_fop_mmap() use "true" instead of "1" on
->mmapped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>


# bb09c863 10-Aug-2016 Tejun Heo <tj@kernel.org>

kernfs: remove kernfs_path_len()

It doesn't have any in-kernel user and the same result can be obtained
from kernfs_path(@kn, NULL, 0). Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>


# 3abb1d90 10-Aug-2016 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_path*() behave in the style of strlcpy()

kernfs_path*() functions always return the length of the full path but
the path content is undefined if the length is larger than the
provided buffer. This makes its behavior different from strlcpy() and
requires error handling in all its users even when they don't care
about truncation. In addition, the implementation can actully be
simplified by making it behave properly in strlcpy() style.

* Update kernfs_path_from_node_locked() to always fill up the buffer
with path. If the buffer is not large enough, the output is
truncated and terminated.

* kernfs_path() no longer needs error handling. Make it a simple
inline wrapper around kernfs_path_from_node().

* sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
Updated accordingly.

* cgroup_path()'s use of kernfs_path() updated to retain the old
behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>


# 0e0b2afd 10-Aug-2016 Tejun Heo <tj@kernel.org>

kernfs: add dummy implementation of kernfs_path_from_node()

The dummy version of kernfs_path_from_node() was missing. This
currently doesn't break anything. Let's add it for consistency and to
ease adding wrappers around it.

v2: Removed stray ';' which was causing build failures.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4f41fc59 09-May-2016 Serge E. Hallyn <serge.hallyn@ubuntu.com>

cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces

Patch summary:

When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.

Short explanation (courtesy of mkerrisk):

If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.

Long version:

When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.

cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF

unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':

grep freezer /proc/self/cgroup
9:freezer:/

If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:

mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.

Example (by mkerrisk):

First, a little script to save some typing and verbiage:

echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:

2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer

Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):

Look at the state of the /proc files:

/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer

The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.

However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:

# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer

The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.

With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:

/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer

cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# e4234a1f 31-Mar-2016 Chris Wilson <chris@chris-wilson.co.uk>

kernfs: Move faulting copy_user operations outside of the mutex

A fault in a user provided buffer may lead anywhere, and lockdep warns
that we have a potential deadlock between the mm->mmap_sem and the
kernfs file mutex:

[ 82.811702] ======================================================
[ 82.811705] [ INFO: possible circular locking dependency detected ]
[ 82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
[ 82.811711] -------------------------------------------------------
[ 82.811714] kms_setmode/5859 is trying to acquire lock:
[ 82.811717] (&dev->struct_mutex){+.+.+.}, at: [<ffffffff8150d9c1>] drm_gem_mmap+0x1a1/0x270
[ 82.811731]
but task is already holding lock:
[ 82.811734] (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
[ 82.811745]
which lock already depends on the new lock.

[ 82.811749]
the existing dependency chain (in reverse order) is:
[ 82.811752]
-> #3 (&mm->mmap_sem){++++++}:
[ 82.811761] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.811766] [<ffffffff8118bc65>] __might_fault+0x75/0xa0
[ 82.811771] [<ffffffff8124da4a>] kernfs_fop_write+0x8a/0x180
[ 82.811787] [<ffffffff811d1023>] __vfs_write+0x23/0xe0
[ 82.811792] [<ffffffff811d1d74>] vfs_write+0xa4/0x190
[ 82.811797] [<ffffffff811d2c14>] SyS_write+0x44/0xb0
[ 82.811801] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.811807]
-> #2 (s_active#6){++++.+}:
[ 82.811814] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.811819] [<ffffffff8124c070>] __kernfs_remove+0x210/0x2f0
[ 82.811823] [<ffffffff8124d040>] kernfs_remove_by_name_ns+0x40/0xa0
[ 82.811828] [<ffffffff8124e9e0>] sysfs_remove_file_ns+0x10/0x20
[ 82.811832] [<ffffffff815318d4>] device_del+0x124/0x250
[ 82.811837] [<ffffffff81531a19>] device_unregister+0x19/0x60
[ 82.811841] [<ffffffff8153c051>] cpu_cache_sysfs_exit+0x51/0xb0
[ 82.811846] [<ffffffff8153c628>] cacheinfo_cpu_callback+0x38/0x70
[ 82.811851] [<ffffffff8109ae89>] notifier_call_chain+0x39/0xa0
[ 82.811856] [<ffffffff8109aef9>] __raw_notifier_call_chain+0x9/0x10
[ 82.811860] [<ffffffff810786de>] cpu_notify+0x1e/0x40
[ 82.811865] [<ffffffff81078779>] cpu_notify_nofail+0x9/0x20
[ 82.811869] [<ffffffff81078ac3>] _cpu_down+0x233/0x340
[ 82.811874] [<ffffffff81079019>] disable_nonboot_cpus+0xc9/0x350
[ 82.811878] [<ffffffff810d2e11>] suspend_devices_and_enter+0x5a1/0xb50
[ 82.811883] [<ffffffff810d3903>] pm_suspend+0x543/0x8d0
[ 82.811888] [<ffffffff810d1b77>] state_store+0x77/0xe0
[ 82.811892] [<ffffffff813fa68f>] kobj_attr_store+0xf/0x20
[ 82.811897] [<ffffffff8124e740>] sysfs_kf_write+0x40/0x50
[ 82.811902] [<ffffffff8124dafc>] kernfs_fop_write+0x13c/0x180
[ 82.811906] [<ffffffff811d1023>] __vfs_write+0x23/0xe0
[ 82.811910] [<ffffffff811d1d74>] vfs_write+0xa4/0x190
[ 82.811914] [<ffffffff811d2c14>] SyS_write+0x44/0xb0
[ 82.811918] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.811923]
-> #1 (cpu_hotplug.lock){+.+.+.}:
[ 82.811929] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.811933] [<ffffffff817b6f72>] mutex_lock_nested+0x62/0x3b0
[ 82.811940] [<ffffffff810784c1>] get_online_cpus+0x61/0x80
[ 82.811944] [<ffffffff811170eb>] stop_machine+0x1b/0xe0
[ 82.811949] [<ffffffffa0178edd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
[ 82.812009] [<ffffffffa017d3a6>] ggtt_bind_vma+0x46/0x70 [i915]
[ 82.812045] [<ffffffffa017eb70>] i915_vma_bind+0x140/0x290 [i915]
[ 82.812081] [<ffffffffa01862b9>] i915_gem_object_do_pin+0x899/0xb00 [i915]
[ 82.812117] [<ffffffffa0186555>] i915_gem_object_pin+0x35/0x40 [i915]
[ 82.812154] [<ffffffffa019a23e>] intel_init_pipe_control+0xbe/0x210 [i915]
[ 82.812192] [<ffffffffa0197312>] intel_logical_rings_init+0xe2/0xde0 [i915]
[ 82.812232] [<ffffffffa0186fe3>] i915_gem_init+0xf3/0x130 [i915]
[ 82.812278] [<ffffffffa02097ed>] i915_driver_load+0xf2d/0x1770 [i915]
[ 82.812318] [<ffffffff81512474>] drm_dev_register+0xa4/0xb0
[ 82.812323] [<ffffffff8151467e>] drm_get_pci_dev+0xce/0x1e0
[ 82.812328] [<ffffffffa01472cf>] i915_pci_probe+0x2f/0x50 [i915]
[ 82.812360] [<ffffffff8143f907>] pci_device_probe+0x87/0xf0
[ 82.812366] [<ffffffff81535f89>] driver_probe_device+0x229/0x450
[ 82.812371] [<ffffffff81536233>] __driver_attach+0x83/0x90
[ 82.812375] [<ffffffff81533c61>] bus_for_each_dev+0x61/0xa0
[ 82.812380] [<ffffffff81535879>] driver_attach+0x19/0x20
[ 82.812384] [<ffffffff8153535f>] bus_add_driver+0x1ef/0x290
[ 82.812388] [<ffffffff81536e9b>] driver_register+0x5b/0xe0
[ 82.812393] [<ffffffff8143e83b>] __pci_register_driver+0x5b/0x60
[ 82.812398] [<ffffffff81514866>] drm_pci_init+0xd6/0x100
[ 82.812402] [<ffffffffa027c094>] 0xffffffffa027c094
[ 82.812406] [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
[ 82.812412] [<ffffffff811595a0>] do_init_module+0x5b/0x1cb
[ 82.812417] [<ffffffff81106160>] load_module+0x1c20/0x2480
[ 82.812422] [<ffffffff81106bae>] SyS_finit_module+0x7e/0xa0
[ 82.812428] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.812433]
-> #0 (&dev->struct_mutex){+.+.+.}:
[ 82.812439] [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
[ 82.812443] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.812456] [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
[ 82.812460] [<ffffffff81196a14>] mmap_region+0x334/0x580
[ 82.812466] [<ffffffff81196fc4>] do_mmap+0x364/0x410
[ 82.812470] [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
[ 82.812474] [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
[ 82.812479] [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
[ 82.812484] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.812489]
other info that might help us debug this:

[ 82.812493] Chain exists of:
&dev->struct_mutex --> s_active#6 --> &mm->mmap_sem

[ 82.812502] Possible unsafe locking scenario:

[ 82.812506] CPU0 CPU1
[ 82.812508] ---- ----
[ 82.812510] lock(&mm->mmap_sem);
[ 82.812514] lock(s_active#6);
[ 82.812519] lock(&mm->mmap_sem);
[ 82.812522] lock(&dev->struct_mutex);
[ 82.812526]
*** DEADLOCK ***

[ 82.812531] 1 lock held by kms_setmode/5859:
[ 82.812533] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
[ 82.812541]
stack backtrace:
[ 82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
[ 82.812550] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
[ 82.812553] 0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
[ 82.812560] ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
[ 82.812566] ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
[ 82.812573] Call Trace:
[ 82.812578] [<ffffffff813f8505>] dump_stack+0x67/0x92
[ 82.812582] [<ffffffff810c84ac>] print_circular_bug+0x1fc/0x310
[ 82.812586] [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
[ 82.812590] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.812594] [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
[ 82.812599] [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
[ 82.812603] [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
[ 82.812608] [<ffffffff81196a14>] mmap_region+0x334/0x580
[ 82.812612] [<ffffffff81196fc4>] do_mmap+0x364/0x410
[ 82.812616] [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
[ 82.812629] [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
[ 82.812633] [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
[ 82.812637] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73

Highly unlikely though this scenario is, we can avoid the issue entirely
by moving the copy operation from out under the kernfs_get_active()
tracking by assigning the preallocated buffer its own mutex. The
temporary buffer allocation doesn't require mutex locking as it is
entirely local.

The locked section was extended by the addition of the preallocated buf
to speed up md user operations in

commit 2b75869bba676c248d8d25ae6d2bd9221dfffdb6
Author: NeilBrown <neilb@suse.de>
Date: Mon Oct 13 16:41:28 2014 +1100

sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# fb3c8315 29-Jan-2016 Aditya Kali <adityakali@google.com>

kernfs: define kernfs_node_dentry

Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 9f6df573 29-Jan-2016 Aditya Kali <adityakali@google.com>

kernfs: Add API to generate relative kernfs path

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# bd96f76a 20-Nov-2015 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_walk_and_get()

Implement kernfs_walk_and_get() which is similar to
kernfs_find_and_get() but can walk a path instead of just a name.

v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
David.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: David Miller <davem@davemloft.net>


# 9acee9c5 18-Aug-2015 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_path_len()

Add a function to determine the path length of a kernfs node. This
for now will be used by writeback tracepoint updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ea015218 13-May-2015 Eric W. Biederman <ebiederm@xmission.com>

kernfs: Add support for always empty directories.

Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# fb02915f 18-Jun-2015 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_get_inode() public

Move kernfs_get_inode() prototype from fs/kernfs/kernfs-internal.h to
include/linux/kernfs.h. It obtains the matching inode for a
kernfs_node.

It will be used by cgroup for inode based permission checks for now
but is generally useful.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# dfeb0750 13-Feb-2015 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_STATIC_NAME

When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid
making a separate copy of its name. It's currently only used for sysfs
attributes whose filenames are required to stay accessible and unchanged.
There are rare exceptions where these names are allocated and formatted
dynamically but for the vast majority of cases they're consts in the
rodata section.

Now that kernfs is converted to use kstrdup_const() and kfree_const(),
there's little point in keeping KERNFS_STATIC_NAME around. Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b75869b 12-Oct-2014 NeilBrown <neilb@suse.de>

sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

md/raid allows metadata management to be performed in user-space.
A various times, particularly on device failure, the metadata needs
to be updated before further writes can be permitted.
This means that the user-space program which updates metadata much
not block on writeout, and so must not allocate memory.

mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all
memory allocation issues for user-memory, but that does not help
kernel memory.
Several kernel objects can be pre-allocated. e.g. files opened before
any writes to the array are permitted.
However some kernel allocation happens in places that cannot be
pre-allocated.
In particular, writes to sysfs files (to tell md that it can now
allow writes to the array) allocate a buffer using GFP_KERNEL.

This patch allows attributes to be marked as "PREALLOC". In that case
the maximal buffer is allocated when the file is opened, and then used
on each write instead of allocating a new buffer.

As the same buffer is now shared for all writes on the same file
description, the mutex is extended to cover full use of the buffer
including the copy_from_user().

The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is
inspected by sysfs_add_file_mode_ns() to determine if the file should be
marked as requiring prealloc.

Despite the comment, we *do* use ->seq_show together with ->prealloc
in this patch. The next patch fixes that.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ecca47ce 01-Jul-2014 Tejun Heo <tj@kernel.org>

kernfs: kernfs_notify() must be useable from non-sleepable contexts

d911d9874801 ("kernfs: make kernfs_notify() trigger inotify events
too") added fsnotify triggering to kernfs_notify() which requires a
sleepable context. There are already existing users of
kernfs_notify() which invoke it from an atomic context and in general
it's silly to require a sleepable context for triggering a
notification.

The following is an invalid context bug triggerd by md invoking
sysfs_notify() from IO completion path.

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
2 locks held by swapper/1/0:
#0: (&(&vblk->vq_lock)->rlock){-.-...}, at: [<ffffffffa0039042>] virtblk_done+0x42/0xe0 [virtio_blk]
#1: (&(&bitmap->counts.lock)->rlock){-.....}, at: [<ffffffff81633718>] bitmap_endwrite+0x68/0x240
irq event stamp: 33518
hardirqs last enabled at (33515): [<ffffffff8102544f>] default_idle+0x1f/0x230
hardirqs last disabled at (33516): [<ffffffff818122ed>] common_interrupt+0x6d/0x72
softirqs last enabled at (33518): [<ffffffff810a1272>] _local_bh_enable+0x22/0x50
softirqs last disabled at (33517): [<ffffffff810a29e0>] irq_enter+0x60/0x80
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
Call Trace:
<IRQ> [<ffffffff81807b4c>] dump_stack+0x4d/0x66
[<ffffffff810d4f14>] __might_sleep+0x184/0x240
[<ffffffff8180caf2>] mutex_lock_nested+0x42/0x440
[<ffffffff812d76a0>] kernfs_notify+0x90/0x150
[<ffffffff8163377c>] bitmap_endwrite+0xcc/0x240
[<ffffffffa00de863>] close_write+0x93/0xb0 [raid1]
[<ffffffffa00df029>] r1_bio_write_done+0x29/0x50 [raid1]
[<ffffffffa00e0474>] raid1_end_write_request+0xe4/0x260 [raid1]
[<ffffffff813acb8b>] bio_endio+0x6b/0xa0
[<ffffffff813b46c4>] blk_update_request+0x94/0x420
[<ffffffff813bf0ea>] blk_mq_end_io+0x1a/0x70
[<ffffffffa00392c2>] virtblk_request_done+0x32/0x80 [virtio_blk]
[<ffffffff813c0648>] __blk_mq_complete_request+0x88/0x120
[<ffffffff813c070a>] blk_mq_complete_request+0x2a/0x30
[<ffffffffa0039066>] virtblk_done+0x66/0xe0 [virtio_blk]
[<ffffffffa002535a>] vring_interrupt+0x3a/0xa0 [virtio_ring]
[<ffffffff81116177>] handle_irq_event_percpu+0x77/0x340
[<ffffffff8111647d>] handle_irq_event+0x3d/0x60
[<ffffffff81119436>] handle_edge_irq+0x66/0x130
[<ffffffff8101c3e4>] handle_irq+0x84/0x150
[<ffffffff818146ad>] do_IRQ+0x4d/0xe0
[<ffffffff818122f2>] common_interrupt+0x72/0x72
<EOI> [<ffffffff8105f706>] ? native_safe_halt+0x6/0x10
[<ffffffff81025454>] default_idle+0x24/0x230
[<ffffffff81025f9f>] arch_cpu_idle+0xf/0x20
[<ffffffff810f5adc>] cpu_startup_entry+0x37c/0x7b0
[<ffffffff8104df1b>] start_secondary+0x25b/0x300

This patch fixes it by punting the notification delivery through a
work item. This ends up adding an extra pointer to kernfs_elem_attr
enlarging kernfs_node by a pointer, which is not ideal but not a very
big deal either. If this turns out to be an actual issue, we can move
kernfs_elem_attr->size to kernfs_node->iattr later.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4e26445f 29-Jun-2014 Li Zefan <lizefan@huawei.com>

kernfs: introduce kernfs_pin_sb()

kernfs_pin_sb() tries to get a refcnt of the superblock.

This will be used by cgroupfs.

v2:
- make kernfs_pin_sb() return the superblock.
- drop kernfs_drop_sb().

tj: Updated the comment a bit.

[ This is a prerequisite for a bugfix. ]
Cc: <stable@vger.kernel.org> # 3.15
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# c9482a5b 26-Apr-2014 Jianyu Zhan <nasa4836@gmail.com>

kernfs: move the last knowledge of sysfs out from kernfs

There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 26fc9cd2 26-Apr-2014 Jianyu Zhan <nasa4836@gmail.com>

kernfs: move the last knowledge of sysfs out from kernfs

There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 555724a8 12-May-2014 Tejun Heo <tj@kernel.org>

kernfs, sysfs, cgroup: restrict extra perm check on open to sysfs

The kernfs open method - kernfs_fop_open() - inherited extra
permission checks from sysfs. While the vfs layer allows ignoring the
read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
sysfs explicitly denied open regardless of the cap if the file doesn't
have any of the UGO perms of the requested access or doesn't implement
the requested operation. It can be debated whether this was a good
idea or not but the behavior is too subtle and dangerous to change at
this point.

After cgroup got converted to kernfs, this extra perm check also got
applied to cgroup breaking libcgroup which opens write-only files with
O_RDWR as root. This patch gates the extra open permission check with
a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
For sysfs, nothing changes. For cgroup, root now can perform any
operation regardless of the permissions as it was before kernfs
conversion. Note that kernfs still fails unimplemented operations
with -EINVAL.

While at it, add comments explaining KERNFS_ROOT flags.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrey Wagin <avagin@gmail.com>
Tested-by: Andrey Wagin <avagin@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
Fixes: 2bd59d48ebfb ("cgroup: convert to kernfs")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7d568a83 09-Apr-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_root->supers list

Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root. Let's implement it - the planned
inotify extension to kernfs_notify() needs it.

Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# b7ce40cf 04-Mar-2014 Tejun Heo <tj@kernel.org>

kernfs: cache atomic_write_len in kernfs_open_file

While implementing atomic_write_len, 4d3773c4bb41 ("kernfs: implement
kernfs_ops->atomic_write_len") moved data copy from userland inside
kernfs_get_active() and kernfs_open_file->mutex so that
kernfs_ops->atomic_write_len can be accessed before copying buffer
from userland; unfortunately, this could lead to locking order
inversion involving mmap_sem if copy_from_user() takes a page fault.

======================================================
[ INFO: possible circular locking dependency detected ]
3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G W
-------------------------------------------------------
trinity-c236/10658 is trying to acquire lock:
(&of->mutex#2){+.+.+.}, at: [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120

but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++++}:
[<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
[<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
[<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
[<mm/memory.c:4188>] might_fault+0x7e/0xb0
[<arch/x86/include/asm/uaccess.h:713 fs/kernfs/file.c:291>] kernfs_fop_write+0xd8/0x190
[<fs/read_write.c:473>] vfs_write+0xe3/0x1d0
[<fs/read_write.c:523 fs/read_write.c:515>] SyS_write+0x5d/0xa0
[<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

-> #0 (&of->mutex#2){+.+.+.}:
[<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
[<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
[<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
[<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
[<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
[<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
[<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
[<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
[<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
[<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
[<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
[<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(&of->mutex#2);
lock(&mm->mmap_sem);
lock(&of->mutex#2);

*** DEADLOCK ***

1 lock held by trinity-c236/10658:
#0: (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0

stack backtrace:
CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26
0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000
0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8
ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8
Call Trace:
[<lib/dump_stack.c:52>] dump_stack+0x52/0x7f
[<kernel/locking/lockdep.c:1213>] print_circular_bug+0x129/0x160
[<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
[<include/linux/spinlock.h:343 mm/slub.c:1933>] ? deactivate_slab+0x511/0x550
[<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
[<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
[<mm/mmap.c:1552>] ? mmap_region+0x24a/0x5c0
[<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
[<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
[<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
[<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
[<kernel/sched/core.c:2477>] ? get_parent_ip+0x11/0x50
[<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
[<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
[<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
[<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
[<mm/util.c:397>] ? vm_mmap_pgoff+0x6e/0xe0
[<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
[<kernel/rcu/update.c:97>] ? __rcu_read_unlock+0x44/0xb0
[<fs/file.c:641>] ? dup_fd+0x3c0/0x3c0
[<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
[<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
[<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

Fix it by caching atomic_write_len in kernfs_open_file during open so
that it can be determined without accessing kernfs_ops in
kernfs_fop_write(). This restores the structure of kernfs_fop_write()
before 4d3773c4bb41 with updated @len determination logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
References: http://lkml.kernel.org/g/53113485.2090407@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# fed95bab 25-Feb-2014 Li Zefan <lizefan@huawei.com>

sysfs: fix namespace refcnt leak

As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

v3:
- Make the new argument as second-to-last arg, suggested by Tejun.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/kernfs/mount.c | 8 +++++++-
fs/sysfs/mount.c | 5 +++--
include/linux/kernfs.h | 9 +++++----
3 files changed, 15 insertions(+), 7 deletions(-)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ba341d55 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: add CONFIG_KERNFS

As sysfs was kernfs's only user, kernfs has been piggybacking on
CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
soon. Introduce a separate config option CONFIG_KERNFS which is to be
selected by kernfs users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 3eef34ad 07-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends

kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.

Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.

* kernfs_name() - format the node's name into the specified buffer
* kernfs_path() - format the node's path into the specified buffer
* pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent() - pin and return a node's parent

All can be called under any context. The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.

v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
static inline making it cause a lot of build warnings. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0c23b225 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename()

Implement helpers to determine node from dentry and root from
super_block. Also add a kernfs_rename_ns() wrapper which assumes NULL
namespace. These generally make sense and will be used by cgroup.

v2: Some dummy implementations for !CONFIG_SYSFS was missing. Fixed.
Reported by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2536390d 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: add kernfs_open_file->priv

Add a private data field to be used by kernfs file operations. This
generally makes sense and will be used by cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4d3773c4 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_ops->atomic_write_len

A write to a kernfs_node is buffered through a kernel buffer. Writes
<= PAGE_SIZE are performed atomically, while larger ones are executed
in PAGE_SIZE chunks. While this is enough for sysfs, cgroup which is
scheduled to be converted to use kernfs needs a bit more control over
it.

This patch adds kernfs_ops->atomic_write_len. If not set (zero), the
behavior stays the same. If set, writes upto the size are executed
atomically and larger writes are rejected with -E2BIG.

A different implementation strategy would be allowing configuring
chunking size while making the original write size available to the
write method; however, such strategy, while being more complicated,
doesn't really buy anything. If the write implementation has to
handle chunking, the specific chunk size shouldn't matter all that
much.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# d35258ef 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: allow nodes to be created in the deactivated state

Currently, kernfs_nodes are made visible to userland on creation,
which makes it difficult for kernfs users to atomically succeed or
fail creation of multiple nodes. In addition, if something fails
after creating some nodes, the created nodes might already be in use
and their active refs need to be drained for removal, which has the
potential to introduce tricky reverse locking dependency on active_ref
depending on how the error path is synchronized.

This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
If set, all nodes under the root are created in the deactivated state
and stay invisible to userland until explicitly enabled by the new
kernfs_activate() API. Also, nodes which have never been activated
are guaranteed to bypass draining on removal thus allowing error paths
to not worry about lockding dependency on active_ref draining.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 6a7fed4e 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_syscall_ops->remount_fs() and ->show_options()

Add two super_block related syscall callbacks ->remount_fs() and
->show_options() to kernfs_syscall_ops. These simply forward the
matching super_operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 90c07c89 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: rename kernfs_dir_ops to kernfs_syscall_ops

We're gonna need non-dir syscall callbacks, which will make dir_ops a
misnomer. Let's rename kernfs_dir_ops to kernfs_syscall_ops.

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 07c7530d 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: invoke dir_ops while holding active ref of the target node

kernfs_dir_ops are currently being invoked without any active
reference, which makes it tricky for the invoked operations to
determine whether the objects associated those nodes are safe to
access and will remain that way for the duration of such operations.

kernfs already has active_ref mechanism to deal with this which makes
the removal of a given node the synchronization point for gating the
file operations. There's no reason for dir_ops to be any different.
Update the dir_ops handling so that active_ref is held while the
dir_ops are executing. This guarantees that while a dir_ops is
executing the target nodes stay alive.

As kernfs_dir_ops doesn't have any in-kernel user at this point, this
doesn't affect anybody.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 6b0afc2a 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers

Sometimes it's necessary to implement a node which wants to delete
nodes including itself. This isn't straightforward because of kernfs
active reference. While a file operation is in progress, an active
reference is held and kernfs_remove() waits for all such references to
drain before completing. For a self-deleting node, this is a deadlock
as kernfs_remove() ends up waiting for an active reference that itself
is sitting on top of.

This currently is worked around in the sysfs layer using
sysfs_schedule_callback() which makes such removals asynchronous.
While it works, it's rather cumbersome and inherently breaks
synchronicity of the operation - the file operation which triggered
the operation may complete before the removal is finished (or even
started) and the removal may fail asynchronously. If a removal
operation is immmediately followed by another operation which expects
the specific name to be available (e.g. removal followed by rename
onto the same name), there's no way to make the latter operation
reliable.

The thing is there's no inherent reason for this to be asynchrnous.
All that's necessary to do this synchronous is a dedicated operation
which drops its own active ref and deactivates self. This patch
implements kernfs_remove_self() and its wrappers in sysfs and driver
core. kernfs_remove_self() is to be called from one of the file
operations, drops the active ref the task is holding, removes the self
node, and restores active ref to the dead node so that the ref is
balanced afterwards. __kernfs_remove() is updated so that it takes an
early exit if the target node is already fully removed so that the
active ref restored by kernfs_remove_self() after removal doesn't
confuse the deactivation path.

This makes implementing self-deleting nodes very easy. The normal
removal path doesn't even need to be changed to use
kernfs_remove_self() for the self-deleting node. The method can
invoke kernfs_remove_self() on itself before proceeding the normal
removal path. kernfs_remove() invoked on the node by the normal
deletion path will simply be ignored.

This will replace sysfs_schedule_callback(). A subtle feature of
sysfs_schedule_callback() is that it collapses multiple invocations -
even if multiple removals are triggered, the removal callback is run
only once. An equivalent effect can be achieved by testing the return
value of kernfs_remove_self() - only the one which gets %true return
value should proceed with actual deletion. All other instances of
kernfs_remove_self() will wait till the enclosing kernfs operation
which invoked the winning instance of kernfs_remove_self() finishes
and then return %false. This trivially makes all users of
kernfs_remove_self() automatically show correct synchronous behavior
even when there are multiple concurrent operations - all "echo 1 >
delete" instances will finish only after the whole operation is
completed by one of the instances.

Note that manipulation of active ref is implemented in separate public
functions - kernfs_[un]break_active_protection().
kernfs_remove_self() is the only user at the moment but this will be
used to cater to more complex cases.

v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
and sysfs_remove_file_self() had incorrect return type. Fix it.
Reported by kbuild test bot.

v3: kernfs_[un]break_active_protection() separated out from
kernfs_remove_self() and exposed as public API.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 81c173cb 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_REMOVED

KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps that of deactivation.

It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations. There's no reason to have them separate making
things more complex than necessary.

This patch removes KERNFS_REMOVED.

* Instead of KERNFS_REMOVED, each node now starts its life
deactivated. This means that we now use both atomic_add() and
atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
generates an overflow warnings when negating INT_MIN as the negation
can't be represented as a positive number. Nothing is actually
broken but let's bump BIAS by one to avoid the warnings for archs
which negates the subtrahend..

* A new helper kernfs_active() which tests whether kn->active >= 0 is
added for convenience and lockdep annotation. All KERNFS_REMOVED
tests are replaced with negated kernfs_active() tests.

* __kernfs_remove() is updated to deactivate, but not drain, all nodes
in the subtree instead of setting KERNFS_REMOVED. This removes
deactivation from kernfs_deactivate(), which is now renamed to
kernfs_drain().

* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
checks on the active ref.

* Some comment style updates in the affected area.

v2: Reordered before removal path restructuring. kernfs_active()
dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
used in the lookup paths.

v3: Reverted most of v2 except for creating a new node with
KN_DEACTIVATED_BIAS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 182fd64b 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()

There currently are two mechanisms gating active ref lockdep
annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
The former disables lockdep annotations in kernfs_get/put_active()
while the latter disables all of kernfs_deactivate().

While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
deactivation step for non-file nodes, the benefit is marginal and it
needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF.

While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
flag so that it's more convenient and the related code can be compiled
out when not enabled.

v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
KERNFS_LOCKDEP flag"). As the earlier patch already added
KERNFS_LOCKDEP tests to kernfs_deactivate(), those additions are
dropped from this patch and the existing ones are simply converted
to kernfs_lockdep().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 988cd7af 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: remove kernfs_addrm_cxt

kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes. The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.

This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().

* kernfs_add_one() is updated to grab and release kernfs_mutex itself.
sysfs_addrm_start/finish() invocations around it are removed from
all users.

* __kernfs_remove() puts an unlinked node directly instead of chaining
it to kernfs_addrm_cxt. Its callers are updated to grab and release
kernfs_mutex instead of calling kernfs_addrm_start/finish() around
it.

v2: Rebased on top of "kernfs: associate a new kernfs_node with its
parent on creation" which dropped @parent from kernfs_add_one().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# abd54f02 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq

kernfs_node->u.completion is used to notify deactivation completion
from kernfs_put_active() to kernfs_deactivate(). We now allow
multiple racing removals of the same node and the current removal
scheme is no longer correct - kernfs_remove() invocation may return
before the node is properly deactivated if it races against another
removal. The removal path will be restructured to address the issue.

To help such restructure which requires supporting multiple waiters,
this patch replaces kernfs_node->u.completion with
kernfs_root->deactivate_waitq. This makes deactivation event
notifications share a per-root waitqueue_head; however, the wait path
is quite cold and this will also allow shaving one pointer off
kernfs_node.

v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
KERNFS_LOCKDEP flag").

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 917f56ca 17-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: add struct dentry declaration in kernfs.h

Hello, Greg.

Two misc fixes for kernfs.

Thanks.
------- 8< -------
struct dentry is used in kernfs.h but its declaration was missing,
leading to compilation errors unless its declaration gets pulled in in
some other way. Add the declaration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 87da1493 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"

This reverts commit ea1c472dfeada211a0100daa7976e8e8e779b858.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0890147f 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"

This reverts commit a69d001cfc712b96ec9d7ba44d6285702a38dabf.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 798c75a0 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove KERNFS_REMOVED"

This reverts commit ae34372eb8408b3d07e870f1939f99007a730d28.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4f4b1b64 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: restructure removal path to fix possible premature return"

This reverts commit 45a140e587f3d32d8d424ed940dffb61e1739047.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7653fe9d 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove kernfs_addrm_cxt"

This reverts commit 99177a34110889a8f2c36420c34e3bcc9bfd8a70.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9b0925a6 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: implement kernfs_{de|re}activate[_self]()"

This reverts commit 9f010c2ad5194a4b682e747984477850fabd03be.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a9f138b0 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"

This reverts commit 1ae06819c77cff1ea2833c94f8c093fe8a5c79db.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1ae06819 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers

Sometimes it's necessary to implement a node which wants to delete
nodes including itself. This isn't straightforward because of kernfs
active reference. While a file operation is in progress, an active
reference is held and kernfs_remove() waits for all such references to
drain before completing. For a self-deleting node, this is a deadlock
as kernfs_remove() ends up waiting for an active reference that itself
is sitting on top of.

This currently is worked around in the sysfs layer using
sysfs_schedule_callback() which makes such removals asynchronous.
While it works, it's rather cumbersome and inherently breaks
synchronicity of the operation - the file operation which triggered
the operation may complete before the removal is finished (or even
started) and the removal may fail asynchronously. If a removal
operation is immmediately followed by another operation which expects
the specific name to be available (e.g. removal followed by rename
onto the same name), there's no way to make the latter operation
reliable.

The thing is there's no inherent reason for this to be asynchrnous.
All that's necessary to do this synchronous is a dedicated operation
which drops its own active ref and deactivates self. This patch
implements kernfs_remove_self() and its wrappers in sysfs and driver
core. kernfs_remove_self() is to be called from one of the file
operations, drops the active ref and deactivates using
__kernfs_deactivate_self(), removes the self node, and restores active
ref to the dead node using __kernfs_reactivate_self() so that the ref
is balanced afterwards. __kernfs_remove() is updated so that it takes
an early exit if the target node is already fully removed so that the
active ref restored by kernfs_remove_self() after removal doesn't
confuse the deactivation path.

This makes implementing self-deleting nodes very easy. The normal
removal path doesn't even need to be changed to use
kernfs_remove_self() for the self-deleting node. The method can
invoke kernfs_remove_self() on itself before proceeding the normal
removal path. kernfs_remove() invoked on the node by the normal
deletion path will simply be ignored.

This will replace sysfs_schedule_callback(). A subtle feature of
sysfs_schedule_callback() is that it collapses multiple invocations -
even if multiple removals are triggered, the removal callback is run
only once. An equivalent effect can be achieved by testing the return
value of kernfs_remove_self() - only the one which gets %true return
value should proceed with actual deletion. All other instances of
kernfs_remove_self() will wait till the enclosing kernfs operation
which invoked the winning instance of kernfs_remove_self() finishes
and then return %false. This trivially makes all users of
kernfs_remove_self() automatically show correct synchronous behavior
even when there are multiple concurrent operations - all "echo 1 >
delete" instances will finish only after the whole operation is
completed by one of the instances.

v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
and sysfs_remove_file_self() had incorrect return type. Fix it.
Reported by kbuild test bot.

v3: Updated to use __kernfs_{de|re}activate_self().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9f010c2a 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_{de|re}activate[_self]()

This patch implements four functions to manipulate deactivation state
- deactivate, reactivate and the _self suffixed pair. A new fields
kernfs_node->deact_depth is added so that concurrent and nested
deactivations are handled properly. kernfs_node->hash is moved so
that it's paired with the new field so that it doesn't increase the
size of kernfs_node.

A kernfs user's lock would normally nest inside active ref but during
removal the user may want to perform kernfs_remove() while holding the
said lock, which would introduce a reverse locking dependency. This
function can be used to break such reverse dependency by allowing
deactivation step to performed separately outside user's critical
section.

This will also be used implement kernfs_remove_self().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 99177a34 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove kernfs_addrm_cxt

kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes. The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.

This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().

* kernfs_add_one() is updated to grab and release the parent's active
ref and kernfs_mutex itself. kernfs_get/put_active() and
kernfs_addrm_start/finish() invocations around it are removed from
all users.

* __kernfs_remove() puts an unlinked node directly instead of chaining
it to kernfs_addrm_cxt. Its callers are updated to grab and release
kernfs_mutex instead of calling kernfs_addrm_start/finish() around
it.

v2: Updated to fit the v2 restructuring of removal path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 45a140e5 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: restructure removal path to fix possible premature return

The recursive nature of kernfs_remove() means that, even if
kernfs_remove() is not allowed to be called multiple times on the same
node, there may be race conditions between removal of parent and its
descendants. While we can claim that kernfs_remove() shouldn't be
called on one of the descendants while the removal of an ancestor is
in progress, such rule is unnecessarily restrictive and very difficult
to enforce. It's better to simply allow invoking kernfs_remove() as
the caller sees fit as long as the caller ensures that the node is
accessible.

The current behavior in such situations is broken. Whoever enters
removal path first takes the node off the hierarchy and then
deactivates. Following removers either return as soon as it notices
that it's not the first one or can't even find the target node as it
has already been removed from the hierarchy. In both cases, the
following removers may finish prematurely while the nodes which should
be removed and drained are still being processed by the first one.

This patch restructures so that multiple removers, whether through
recursion or direction invocation, always follow the following rules.

* When there are multiple concurrent removers, only one puts the base
ref.

* Regardless of which one puts the base ref, all removers are blocked
until the target node is fully deactivated and removed.

To achieve the above, removal path now first deactivates the subtree,
drains it and then unlinks one-by-one. __kernfs_deactivate() is
called directly from __kernfs_removal() and drops and regrabs
kernfs_mutex for each descendant to drain active refs. As this means
that multiple removers can enter __kernfs_deactivate() for the same
node, the function is updated so that it can handle multiple
deactivators of the same node - only one actually deactivates but all
wait till drain completion.

The restructured removal path guarantees that a removed node gets
unlinked only after the node is deactivated and drained. Combined
with proper multiple deactivator handling, this guarantees that any
invocation of kernfs_remove() returns only after the node itself and
all its descendants are deactivated, drained and removed.

v2: Draining separated into a separate loop (used to be in the same
loop as unlink) and done from __kernfs_deactivate(). This is to
allow exposing deactivation as a separate interface later.

Root node removal was broken in v1 patch. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ae34372e 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_REMOVED

KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps those of deactivation and
removal from rbtree.

It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations. There's no reason to have them separate making
things more complex than necessary.

KERNFS_REMOVED is also used to decide whether a node is still visible
to vfs layer, which is rather redundant as equivalent determination
can be made by testing whether the node is on its parent's children
rbtree or not.

This patch removes KERNFS_REMOVED.

* Instead of KERNFS_REMOVED, each node now starts its life
deactivated. This means that we now use both atomic_add() and
atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
generates an overflow warnings when negating INT_MIN as the negation
can't be represented as a positive number. Nothing is actually
broken but let's bump BIAS by one to avoid the warnings for archs
which negates the subtrahend..

* KERNFS_REMOVED tests in add and rename paths are replaced with
kernfs_get/put_active() of the target nodes. Due to the way the add
path is structured now, active ref handling is done in the callers
of kernfs_add_one(). This will be consolidated up later.

* kernfs_remove_one() is updated to deactivate instead of setting
KERNFS_REMOVED. This removes deactivation from kernfs_deactivate(),
which is now renamed to kernfs_drain().

* kernfs_dop_revalidate() now tests RB_EMPTY_NODE(&kn->rb) instead of
KERNFS_REMOVED and KERNFS_REMOVED test in kernfs_dir_pos() is
dropped. A node which is removed from the children rbtree is not
included in the iteration in the first place. This means that a
node may be visible through vfs a bit longer - it's now also visible
after deactivation until the actual removal. This slightly enlarged
window difference doesn't make any difference to the userland.

* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
checks on the active ref.

* Some comment style updates in the affected area.

v2: Reordered before removal path restructuring. kernfs_active()
dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
used in the lookup paths.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a69d001c 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()

There currently are two mechanisms gating active ref lockdep
annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
The former disables lockdep annotations in kernfs_get/put_active()
while the latter disables all of kernfs_deactivate().

While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
deactivation step for non-file nodes, the benefit is marginal and it
needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF and use
KERNFS_LOCKDEP in kernfs_deactivate() too.

While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
flag so that it's more convenient and the related code can be compiled
out when not enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ea1c472d 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq

kernfs_node->u.completion is used to notify deactivation completion
from kernfs_put_active() to kernfs_deactivate(). We now allow
multiple racing removals of the same node and the current removal
scheme is no longer correct - kernfs_remove() invocation may return
before the node is properly deactivated if it races against another
removal. The removal path will be restructured to address the issue.

To help such restructure which requires supporting multiple waiters,
this patch replaces kernfs_node->u.completion with
kernfs_root->deactivate_waitq. This makes deactivation event
notifications share a per-root waitqueue_head; however, the wait path
is quite cold and this will also allow shaving one pointer off
kernfs_node.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 80b9bbef 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: add kernfs_dir_ops

Add support for mkdir(2), rmdir(2) and rename(2) syscalls. This is
implemented through optional kernfs_dir_ops callback table which can
be specified on kernfs_create_root(). An implemented callback is
invoked when the matching syscall is invoked.

As kernfs keep dcache syncs with internal representation and
revalidates dentries on each access, the implementation of these
methods is extremely simple. Each just discovers the relevant
kernfs_node(s) and invokes the requested callback which is allowed to
do any kernfs operations and the end result doesn't necessarily have
to match the expected semantics of the syscall.

This will be used to convert cgroup to use kernfs instead of its own
filesystem implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2063d608 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: mark static names with KERNFS_STATIC_NAME

Because sysfs used struct attribute which are supposed to stay
constant, sysfs didn't copy names when creating regular files. The
specified string for name was supposed to stay constant. Such
distinction isn't inherent for kernfs. kernfs_create_file[_ns]()
should be able to take the same @name as kernfs_create_dir[_ns]()

As there can be huge number of sysfs attributes, we still want to be
able to use static names for sysfs attributes. This patch renames
kernfs_create_file_ns_key() to __kernfs_create_file() and adds
@name_is_static parameter so that the caller can explicitly indicate
that @name can be used without copying. kernfs is updated to use
KERNFS_STATIC_NAME to distinguish static and copied names.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bb8b9d09 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: add @mode to kernfs_create_dir[_ns]()

sysfs assumed 0755 for all newly created directories and kernfs
inherited it. This assumption is unnecessarily restrictive and
inconsistent with kernfs_create_file[_ns](). This patch adds @mode
parameter to kernfs_create_dir[_ns]() and update uses in sysfs
accordingly. Among others, this will be useful for implementations of
the planned ->mkdir() method.

This patch doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# df23fc39 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in constants

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/SYSFS_DIR/KERNFS_DIR/
* s/SYSFS_KOBJ_ATTR/KERNFS_FILE/
* s/SYSFS_KOBJ_LINK/KERNFS_LINK/
* s/SYSFS_{TYPE_FLAGS}/KERNFS_{TYPE_FLAGS}/
* s/SYSFS_FLAG_{FLAG}/KERNFS_{FLAG}/
* s/sysfs_type()/kernfs_type()/
* s/SD_DEACTIVATED_BIAS/KN_DEACTIVATED_BIAS/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c525aadd 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in various data structures

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_open_dirent/kernfs_open_node/
* s/sysfs_open_file/kernfs_open_file/
* s/sysfs_inode_attrs/kernfs_iattrs/
* s/sysfs_addrm_cxt/kernfs_addrm_cxt/
* s/sysfs_super_info/kernfs_super_info/
* s/sysfs_info()/kernfs_info()/
* s/sysfs_open_dirent_lock/kernfs_open_node_lock/
* s/sysfs_open_file_mutex/kernfs_open_file_mutex/
* s/sysfs_of()/kernfs_of()/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# adc5e8b5 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: drop s_ prefix from kernfs_node members

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

s_ prefix for kernfs members is used inconsistently and a misnomer
now. It's not like kernfs_node is used widely across the kernel
making the ability to grep for the members particularly useful. Let's
just drop the prefix.

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 324a56e1 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above

Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them. All are sysfs_dirent renames and trivial. While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.

This patch is strictly rename only and doesn't introduce any
functional difference.

- mic / gpio renames were missing. Spotted by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ac9bba03 29-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: implement kernfs_ns_enabled()

fs/sysfs/symlink.c::sysfs_delete_link() tests @sd->s_flags for
SYSFS_FLAG_NS. Let's add kernfs_ns_enabled() so that sysfs doesn't
have to test sysfs_dirent flag directly. This makes things tidier for
kernfs proper too.

This is purely cosmetic.

v2: To avoid possible NULL deref, use noop dummy implementation which
always returns false when !CONFIG_SYSFS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# cf9e5a73 29-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: make sysfs_dirent definition public

sysfs_dirent includes some information which should be available to
kernfs users - the type, flags, name and parent pointer. This patch
moves sysfs_dirent definition from kernfs/kernfs-internal.h to
include/linux/kernfs.h so that kernfs users can access them.

The type part of flags is exported as enum kernfs_node_type, the flags
kernfs_node_flag, sysfs_type() and kernfs_enable_ns() are moved to
include/linux/kernfs.h and the former is updated to return the enum
type. sysfs_dirent->s_parent and ->s_name are marked explicitly as
public.

This patch doesn't introduce any functional changes.

v2: Flags exported too and kernfs_enable_ns() definition moved.

v3: While moving kernfs_enable_ns() to include/linux/kernfs.h, v1 and
v2 put the definition outside CONFIG_SYSFS replacing the dummy
implementation with the actual implementation too. Unfortunately,
this can lead to oops when !CONFIG_SYSFS because
kernfs_enable_ns() may be called on a NULL @sd and now tries to
dereference @sd instead of not doing anything. This issue was
reported by Yuanhan Liu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4b93dc9b 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: prepare mount path for kernfs

We're in the process of separating out core sysfs functionality into
kernfs which will deal with sysfs_dirents directly. This patch
rearranges mount path so that the kernfs and sysfs parts are separate.

* As sysfs_super_info won't be visible outside kernfs proper,
kernfs_super_ns() is added to allow kernfs users to access a
super_block's namespace tag.

* Generic mount operation is separated out into kernfs_mount_ns().
sysfs_mount() now just performs sysfs-specific permission check,
acquires namespace tag, and invokes kernfs_mount_ns().

* Generic superblock release is separated out into kernfs_kill_sb()
which can be used directly as file_system_type->kill_sb(). As sysfs
needs to put the namespace tag, sysfs_kill_sb() wraps
kernfs_kill_sb() with ns tag put.

* sysfs_dir_cachep init and sysfs_inode_init() are separated out into
kernfs_init(). kernfs_init() uses only small amount of memory and
trying to handle and propagate kernfs_init() failure doesn't make
much sense. Use SLAB_PANIC for sysfs_dir_cachep and make
sysfs_inode_init() panic on failure.

After this change, kernfs_init() should be called before
sysfs_init(), fs/namespace.c::mnt_init() modified accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bc755553 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: make inode number ida per kernfs_root

kernfs is being updated to allow multiple sysfs_dirent hierarchies so
that it can also be used by other users. Currently, inode number is
allocated using a global ida, sysfs_ino_ida; however, inos for
different hierarchies should be handled separately.

This patch makes ino allocation per kernfs_root. sysfs_ino_ida is
replaced by kernfs_root->ino_ida and sysfs_new_dirent() is updated to
take @root and allocate ino from it. ida_simple_get/remove() are used
instead of sysfs_ino_lock and sysfs_alloc/free_ino().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ba7443bc 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: implement kernfs_create/destroy_root()

There currently is single kernfs hierarchy in the whole system which
is used for sysfs. kernfs needs to support multiple hierarchies to
allow other users. This patch introduces struct kernfs_root which
serves as the root of each kernfs hierarchy and implements
kernfs_create/destroy_root().

* Each kernfs_root is associated with a root sd (sysfs_dentry). The
root is freed when the root sd is released and kernfs_destory_root()
simply invokes kernfs_remove() on the root sd. sysfs_remove_one()
is updated to handle release of the root sd. Note that ps_iattr
update in sysfs_remove_one() is trivially updated for readability.

* Root sd's are now dynamically allocated using sysfs_new_dirent().
Update sysfs_alloc_ino() so that it gives out ino from 1 so that the
root sd still gets ino 1.

* While kernfs currently only points to the root sd, it'll soon grow
fields which are specific to each hierarchy. As determining a given
sd's root will be necessary, sd->s_dir.root is added. This backlink
fits better as a separate field in sd; however, sd->s_dir is inside
union with space to spare, so use it to save space and provide
kernfs_root() accessor to determine the root sd.

* As hierarchies may be destroyed now, each mount needs to hold onto
the hierarchy it's attached to. Update sysfs_fill_super() and
sysfs_kill_sb() so that they get and put the kernfs_root
respectively.

* sysfs_root is replaced with kernfs_root which is dynamically created
by invoking kernfs_create_root() from sysfs_init().

This patch doesn't introduce any visible behavior changes.

v2: kernfs_create_root() forgot to set @sd->priv. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ccf73cf3 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs[_find_and]_get() and kernfs_put()

Introduce kernfs interface for finding, getting and putting
sysfs_dirents.

* sysfs_find_dirent() is renamed to kernfs_find_ns() and lockdep
assertion for sysfs_mutex is added.

* sysfs_get_dirent_ns() is renamed to kernfs_find_and_get().

* Macro inline dancing around __sysfs_get/put() are removed and
kernfs_get/put() are made proper functions implemented in
fs/sysfs/dir.c.

While the conversions are mostly equivalent, there's one difference -
kernfs_get() doesn't return the input param as its return value. This
change is intentional. While passing through the input increases
writability in some areas, it is unnecessary and has been shown to
cause confusion regarding how the last ref is handled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 517e64f5 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: revamp sysfs_dirent active_ref lockdep annotation

Currently, sysfs_dirent active_ref lockdep annotation uses
attribute->[s]key as the lockdep key, which forces
kernfs_create_file_ns() to assume that sysfs_dirent->priv is pointing
to a struct attribute which may not be true for non-sysfs users. This
patch restructures the lockdep annotation such that

* kernfs_ops contains lockdep_key which is used by default for files
created kernfs_create_file_ns().

* kernfs_create_file_ns_key() is introduced which takes an extra @key
argument. The created file will use the specified key for
active_ref lockdep annotation. If NULL is specified, lockdep for
the file is disabled.

* sysfs_add_file_mode_ns() is updated to use
kernfs_create_file_ns_key() with the appropriate key from the
attribute or NULL if ignore_lockdep is set.

This makes the lockdep annotation properly contained in kernfs while
allowing sysfs to cleanly keep its current behavior. This patch
doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 024f6471 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs_notify()

Introduce kernfs interface to wake up poll(2) which takes and returns
sysfs_dirents.

sysfs_notify_dirent() is renamed to kernfs_notify() and sysfs_notify()
is updated so that it doesn't directly grab sysfs_mutex but acquires
the target sysfs_dirents using sysfs_get_dirent().
sysfs_notify_dirent() is reimplemented as a dumb inline wrapper around
kernfs_notify().

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# d19b9846 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: add kernfs_ops->seq_{start|next|stop}()

kernfs_ops currently only supports single_open() behavior which is
pretty restrictive. Add optional callbacks ->seq_{start|next|stop}()
which, when implemented, are invoked for seq_file traversal. This
allows full seq_file functionality for kernfs users. This currently
doesn't have any user and doesn't change any behavior.

v2: Refreshed on top of the updated "sysfs, kernfs: prepare read path
for kernfs".

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 496f7394 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs_create_file[_ns]()

Introduce kernfs interface to create a file which takes and returns
sysfs_dirents.

The actual file creation part is separated out from
sysfs_add_file_mode_ns() into kernfs_create_file_ns(). The former now
only decides the kernfs_ops to use and the file's size and invokes the
latter.

This patch doesn't introduce behavior changes.

v2: Dummy implementation for !CONFIG_SYSFS updated to return -ENOSYS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f6acf8bb 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs_ops

We're in the process of separating out core sysfs functionality into
kernfs which will deal with sysfs_dirents directly. This patch
introduces kernfs_ops which hosts methods kernfs users implement and
updates fs/sysfs/file.c such that sysfs_kf_*() functions populate
kernfs_ops and kernfs_file_*() functions call the matching entries
from kernfs_ops.

kernfs_ops contains the following groups of methods.

* seq_show() - for kernfs files which use seq_file for reads.

* read() - for direct read implementations. Used iff seq_show() is
not implemented.

* write() - for writes.

* mmap() - for mmaps.

Notes:

* sysfs_elem_attr->ops is added so that kernfs_ops can be accessed
from sysfs_dirent. kernfs_ops() helper is added to verify locking
and access the field.

* SYSFS_FLAG_HAS_(SEQ_SHOW|MMAP) added. sd->s_attr->ops is accessible
only while holding active_ref and there are cases where we want to
take different actions depending on which ops are implemented.
These two flags cache whether the two ops are implemented for those.

* kernfs_file_*() no longer test sysfs type but chooses different
behaviors depending on which methods in kernfs_ops are implemented.
The conversions are trivial except for the open path. As
kernfs_file_open() now decides whether to allow read/write accesses
depending on the kernfs_ops implemented, the presence of methods in
kobjs and attribute_bin should be propagated to kernfs_ops.
sysfs_add_file_mode_ns() is updated so that it propagates presence /
absence of the callbacks through _empty, _ro, _wo, _rw kernfs_ops.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# dd8a5b03 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: move sysfs_open_file to include/linux/kernfs.h

sysfs_open_file will be used as the primary handle for kernfs methods.
Move its definition from fs/sysfs/file.c to include/linux/kernfs.h and
mark the public and private fields.

This is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 93b2b8e4 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs_create_dir[_ns]()

Introduce kernfs interface to manipulate a directory which takes and
returns sysfs_dirents.

create_dir() is renamed to kernfs_create_dir_ns() and its argumantes
and return value are updated. create_dir() usages are replaced with
kernfs_create_dir_ns() and sysfs_create_subdir() usages are replaced
with kernfs_create_dir(). Dup warnings are handled explicitly by
sysfs users of the kernfs interface.

sysfs_enable_ns() is renamed to kernfs_enable_ns().

This patch doesn't introduce any behavior changes.

v2: Dummy implementation for !CONFIG_SYSFS updated to return -ENOSYS.

v3: kernfs_enable_ns() added.

v4: Refreshed on top of "sysfs: drop kobj_ns_type handling, take #2"
so that this patch removes sysfs_enable_ns().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 5d60418e 23-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs_setattr()

Introduce kernfs setattr interface - kernfs_setattr().

sysfs_sd_setattr() is renamed to __kernfs_setattr() and
kernfs_setattr() is a simple wrapper around it with sysfs_mutex
locking. sysfs_chmod_file() is updated to get an explicit ref on
kobj->sd and then invoke kernfs_setattr() so that it doesn't have to
use internal interface.

This patch doesn't introduce any behavior differences.

v2: Dummy implementation for !CONFIG_SYSFS updated to return -ENOSYS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 890ece16 23-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs_rename[_ns]()

Introduce kernfs rename interface, krenfs_rename[_ns]().

This is just rename of sysfs_rename(). No functional changes.
Function comment is added to kernfs_rename_ns() and @new_parent_sd is
renamed to @new_parent for consistency with other kernfs interfaces.

v2: Dummy implementation for !CONFIG_SYSFS updated to return -ENOSYS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 5d0e26bb 23-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs_create_link()

Separate out kernfs symlink interface - kernfs_create_link() - which
takes and returns sysfs_dirents, from sysfs_do_create_link_sd().
sysfs_do_create_link_sd() now just determines the parent and target
sysfs_dirents and invokes the new interface and handles dup warning.

This patch doesn't introduce behavior changes.

v2: Dummy implementation for !CONFIG_SYSFS updated to return -ENOSYS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 879f40d1 23-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: introduce kernfs_remove[_by_name[_ns]]()

Introduce kernfs removal interfaces - kernfs_remove() and
kernfs_remove_by_name[_ns]().

These are just renames of sysfs_remove() and sysfs_hash_and_remove().
No functional changes.

v2: Dummy kernfs_remove_by_name_ns() for !CONFIG_SYSFS updated to
return -ENOSYS instead of 0.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# b8441ed2 24-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: add skeletons for kernfs

Core sysfs implementation will be separated into kernfs so that it can
be used by other non-kobject users.

This patch creates fs/kernfs/ directory and makes boilerplate changes.
kernfs interface will be directly based on sysfs_dirent and its
forward declaration is moved to include/linux/kernfs.h which is
included from include/linux/sysfs.h. sysfs core implementation will
be gradually separated out and moved to kernfs.

This patch doesn't introduce any functional changes.

v2: mount.c added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>