History log of /linux-master/fs/kernfs/kernfs-internal.h
Revision Date Author Comments
# 4207b556 09-Jan-2024 Tejun Heo <tj@kernel.org>

kernfs: RCU protect kernfs_nodes and avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()

The BPF helper bpf_cgroup_from_id() calls kernfs_find_and_get_node_by_id()
which acquires kernfs_idr_lock, which is an non-raw non-IRQ-safe lock. This
can lead to deadlocks as bpf_cgroup_from_id() can be called from any BPF
programs including e.g. the ones that attach to functions which are holding
the scheduler rq lock.

Consider the following BPF program:

SEC("fentry/__set_cpus_allowed_ptr_locked")
int BPF_PROG(__set_cpus_allowed_ptr_locked, struct task_struct *p,
struct affinity_context *affn_ctx, struct rq *rq, struct rq_flags *rf)
{
struct cgroup *cgrp = bpf_cgroup_from_id(p->cgroups->dfl_cgrp->kn->id);

if (cgrp) {
bpf_printk("%d[%s] in %s", p->pid, p->comm, cgrp->kn->name);
bpf_cgroup_release(cgrp);
}
return 0;
}

__set_cpus_allowed_ptr_locked() is called with rq lock held and the above
BPF program calls bpf_cgroup_from_id() within leading to the following
lockdep warning:

=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.7.0-rc3-work-00053-g07124366a1d7-dirty #147 Not tainted
-----------------------------------------------------
repro/1620 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffffffff833b3688 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1e/0x70

and this task is already holding:
ffff888237ced698 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0x4e/0xf0
which would create a new lock dependency:
(&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2}
...
Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(kernfs_idr_lock);
local_irq_disable();
lock(&rq->__lock);
lock(kernfs_idr_lock);
<Interrupt>
lock(&rq->__lock);

*** DEADLOCK ***
...
Call Trace:
dump_stack_lvl+0x55/0x70
dump_stack+0x10/0x20
__lock_acquire+0x781/0x2a40
lock_acquire+0xbf/0x1f0
_raw_spin_lock+0x2f/0x40
kernfs_find_and_get_node_by_id+0x1e/0x70
cgroup_get_from_id+0x21/0x240
bpf_cgroup_from_id+0xe/0x20
bpf_prog_98652316e9337a5a___set_cpus_allowed_ptr_locked+0x96/0x11a
bpf_trampoline_6442545632+0x4f/0x1000
__set_cpus_allowed_ptr_locked+0x5/0x5a0
sched_setaffinity+0x1b3/0x290
__x64_sys_sched_setaffinity+0x4f/0x60
do_syscall_64+0x40/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e

Let's fix it by protecting kernfs_node and kernfs_root with RCU and making
kernfs_find_and_get_node_by_id() acquire rcu_read_lock() instead of
kernfs_idr_lock.

This adds an rcu_head to kernfs_node making it larger by 16 bytes on 64bit.
Combined with the preceding rearrange patch, the net increase is 8 bytes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20240109214828.252092-4-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ffb2e065 29-Sep-2023 Wedson Almeida Filho <walmeida@microsoft.com>

kernfs: move kernfs_xattr_handlers to .rodata

This makes it harder for accidental or malicious changes to
kernfs_xattr_handlers at runtime.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-18-wedsonaf@gmail.com
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# c9f2dfb7 09-Mar-2023 Imran Khan <imran.f.khan@oracle.com>

kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.

Right now per-fs kernfs_rwsem protects list of kernfs_super_info instances
for a kernfs_root. Since kernfs_rwsem is used to synchronize several other
operations across kernfs and since most of these operations don't impact
kernfs_super_info, we can use a separate per-fs rwsem to synchronize access
to list of kernfs_super_info.
This helps in reducing contention around kernfs_rwsem and also allows
operations that change/access list of kernfs_super_info to proceed without
contending for kernfs_rwsem.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20230309110932.2889010-3-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9caf6961 09-Mar-2023 Imran Khan <imran.f.khan@oracle.com>

kernfs: Introduce separate rwsem to protect inode attributes.

Right now a global per-fs rwsem (kernfs_rwsem) synchronizes multiple
kernfs operations. On a large system with few hundred CPUs and few
hundred applications simultaneoulsy trying to access sysfs, this
results in multiple sys_open(s) contending on kernfs_rwsem via
kernfs_iop_permission and kernfs_dop_revalidate.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];

int ret_len;
int ret_fd;
ssize_t ret_rd;

ub4 i, saved_errno;

memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));

ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);

ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}

/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);

if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);

continue;
}

close(ret_fd);
}
}

I see contention around kernfs_rwsem as follows:

path_openat
|
|----link_path_walk.part.0.constprop.0
| |
| |--49.92%--inode_permission
| | |
| | --48.69%--kernfs_iop_permission
| | |
| | |--18.16%--down_read
| | |
| | |--15.38%--up_read
| | |
| | --14.58%--_raw_spin_lock
| | |
| | -----
| |
| |--29.08%--walk_component
| | |
| | --29.02%--lookup_fast
| | |
| | |--24.26%--kernfs_dop_revalidate
| | | |
| | | |--14.97%--down_read
| | | |
| | | --9.01%--up_read
| | |
| | --4.74%--__d_lookup
| | |
| | --4.64%--_raw_spin_lock
| | |
| | ----

Having a separate per-fs rwsem to protect kernfs inode attributes,
will avoid the above mentioned contention and result in better
performance as can bee seen below:

path_openat
|
|----link_path_walk.part.0.constprop.0
| |
| |
| |--27.06%--inode_permission
| | |
| | --25.84%--kernfs_iop_permission
| | |
| | |--9.29%--up_read
| | |
| | |--8.19%--down_read
| | |
| | --7.89%--_raw_spin_lock
| | |
| | ----
| |
| |--22.42%--walk_component
| | |
| | --22.36%--lookup_fast
| | |
| | |--16.07%--__d_lookup
| | | |
| | | --16.01%--_raw_spin_lock
| | | |
| | | ----
| | |
| | --6.28%--kernfs_dop_revalidate
| | |
| | |--3.76%--down_read
| | |
| | --2.26%--up_read

As can be seen from the above data the overhead due to both
kerfs_iop_permission and kernfs_dop_revalidate have gone down and
this also reduces overall run time of the earlier mentioned loop.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20230309110932.2889010-2-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4609e1f1 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->permission() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# b74d24f7 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->getattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c1632a0f 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->setattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 24b3e3dd 11-Nov-2022 Randy Dunlap <rdunlap@infradead.org>

kernfs: fix all kernel-doc warnings and multiple typos

Fix kernel-doc warnings. Many of these are about a function's
return value, so use the kernel-doc Return: format to fix those

Use % prefix on numeric constant values.

dir.c: fix typos/spellos
file.c fix typo: s/taret/target/

Fix all of these kernel-doc warnings:

dir.c:305: warning: missing initial short description on line:
* kernfs_name_hash

dir.c:137: warning: No description found for return value of 'kernfs_path_from_node_locked'
dir.c:196: warning: No description found for return value of 'kernfs_name'
dir.c:224: warning: No description found for return value of 'kernfs_path_from_node'
dir.c:292: warning: No description found for return value of 'kernfs_get_parent'
dir.c:312: warning: No description found for return value of 'kernfs_name_hash'
dir.c:404: warning: No description found for return value of 'kernfs_unlink_sibling'
dir.c:588: warning: No description found for return value of 'kernfs_node_from_dentry'
dir.c:806: warning: No description found for return value of 'kernfs_find_ns'
dir.c:879: warning: No description found for return value of 'kernfs_find_and_get_ns'
dir.c:904: warning: No description found for return value of 'kernfs_walk_and_get_ns'
dir.c:927: warning: No description found for return value of 'kernfs_create_root'
dir.c:996: warning: No description found for return value of 'kernfs_root_to_node'
dir.c:1016: warning: No description found for return value of 'kernfs_create_dir_ns'
dir.c:1048: warning: No description found for return value of 'kernfs_create_empty_dir'
dir.c:1306: warning: No description found for return value of 'kernfs_next_descendant_post'
dir.c:1568: warning: No description found for return value of 'kernfs_remove_self'
dir.c:1630: warning: No description found for return value of 'kernfs_remove_by_name_ns'
dir.c:1667: warning: No description found for return value of 'kernfs_rename_ns'

file.c:66: warning: No description found for return value of 'of_on'
file.c:88: warning: No description found for return value of 'kernfs_deref_open_node_locked'
file.c:1036: warning: No description found for return value of '__kernfs_create_file'

inode.c:100: warning: No description found for return value of 'kernfs_setattr'

mount.c:160: warning: No description found for return value of 'kernfs_root_from_sb'
mount.c:198: warning: No description found for return value of 'kernfs_node_dentry'
mount.c:302: warning: No description found for return value of 'kernfs_super_ns'
mount.c:318: warning: No description found for return value of 'kernfs_get_tree'

symlink.c:28: warning: No description found for return value of 'kernfs_create_link'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221112031456.22980-1-rdunlap@infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bdb2fd7f 27-Aug-2022 Tejun Heo <tj@kernel.org>

kernfs: Skip kernfs_drain_open_files() more aggressively

Track the number of mmapped files and files that need to be released and
skip kernfs_drain_open_file() if both are zero, which are the precise
conditions which require draining open_files. The early exit test is
factored into kernfs_should_drain_open_files() which is now tested by
kernfs_drain_open_files()'s caller - kernfs_drain().

This isn't a meaningful optimization on its own but will enable future
stand-alone kernfs_deactivate() implementation.

v2: Chengming noticed that on->nr_to_release was leaking after ->open()
failure. Fix it by telling kernfs_unlink_open_file() that it's called
from the ->open() fail path and should dec the counter. Use kzalloc() to
allocate kernfs_open_node so that the tracking fields are correctly
initialized.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-5-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1d25b84e 14-Jun-2022 Imran Khan <imran.f.khan@oracle.com>

kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.

In current kernfs design a single mutex, kernfs_open_file_mutex, protects
the list of kernfs_open_file instances corresponding to a sysfs attribute.
So even if different tasks are opening or closing different sysfs files
they can contend on osq_lock of this mutex. The contention is more apparent
in large scale systems with few hundred CPUs where most of the CPUs have
running tasks that are opening, accessing or closing sysfs files at any
point of time.

Using hashed mutexes in place of a single global mutex, can significantly
reduce contention around global mutex and hence can provide better
scalability. Moreover as these hashed mutexes are not part of kernfs_node
objects we will not see any singnificant change in memory utilization of
kernfs based file systems like sysfs, cgroupfs etc.

Modify interface introduced in previous patch to make use of hashed
mutexes. Use kernfs_node address as hashing key.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20220615021059.862643-5-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f2eb478f 22-Feb-2022 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

kernfs: move struct kernfs_root out of the public view.

There is no need to have struct kernfs_root be part of kernfs.h for
the whole kernel to see and poke around it. Move it internal to kernfs
code and provide a helper function, kernfs_root_to_node(), to handle the
one field that kernfs users were directly accessing from the structure.

Cc: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f3a69022 17-Feb-2022 Imran Khan <imran.f.khan@oracle.com>

kernfs: remove redundant kernfs_rwsem declaration.

Since 'commit 393c3714081a ("kernfs: switch global kernfs_rwsem lock to
per-fs lock")' per-fs kernfs_rwsem has replaced global kernfs_rwsem.
Remove redundant declaration of global kernfs_rwsem.

Fixes: 393c3714081a ("kernfs: switch global kernfs_rwsem lock to per-fs lock")
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20220218010205.717582-1-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7ba0273b 16-Jul-2021 Ian Kent <raven@themaw.net>

kernfs: switch kernfs to use an rwsem

The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel during path walks.

Change the kernfs mutex to an rwsem so that, when opportunity arises,
node searches can be done in parallel with path walk lookups.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/162642770946.63632.2218304587223241374.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 895adbec 16-Jul-2021 Ian Kent <raven@themaw.net>

kernfs: add a revision to identify directory node changes

Add a revision counter to kernfs directory nodes so it can be used
to detect if a directory node has changed during negative dentry
revalidation.

There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
on all architectures and as far as I know that assumption holds.

So adding a revision counter to the struct kernfs_elem_dir variant of
the kernfs_node type union won't increase the size of the kernfs_node
struct. This is because struct kernfs_elem_dir is at least
sizeof(pointer) smaller than the largest union variant. It's tempting
to make the revision counter a u64 but that would increase the size of
kernfs_node on archs where sizeof(pointer) is smaller than the revision
counter.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/162642769895.63632.8356662784964509867.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 0c47383b 12-Mar-2020 Daniel Xu <dxu@dxuuu.xyz>

kernfs: Add option to enable user xattrs

User extended attributes are useful as metadata storage for kernfs
consumers like cgroups. Especially in the case of cgroups, it is useful
to have a central metadata store that multiple processes/services can
use to coordinate actions.

A concrete example is for userspace out of memory killers. We want to
let delegated cgroup subtree owners (running as non-root) to be able to
say "please avoid killing this cgroup". This is especially important for
desktop linux as delegated subtrees owners are less likely to run as
root.

This patch introduces a new flag, KERNFS_ROOT_SUPPORT_USER_XATTR, that
lets kernfs consumers enable user xattr support. An initial limit of 128
entries or 128KB -- whichever is hit first -- is placed per cgroup
because xattrs come from kernel memory and we don't want to let
unprivileged users accidentally eat up too much kernel memory.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# fe0f726c 04-Nov-2019 Tejun Heo <tj@kernel.org>

kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()

kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
specified ino. On top of that, kernfs_get_node_by_id() and
kernfs_fh_get_inode() implement full ID matching by testing the rest
of ID.

On surface, confusingly, the two are slightly different in that the
latter uses 0 gen as wildcard while the former doesn't - does it mean
that the latter can't uniquely identify inodes w/ 0 gen? In practice,
this is a distinction without a difference because generation number
starts at 1. There are no actual IDs with 0 gen, so it can always
safely used as wildcard.

Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
and removing now unnecessary kernfs_get_node_by_id().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 55716d26 01-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428

Based on 1 normalized pattern(s):

this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 0ac6075a 22-Feb-2019 Ondrej Mosnacek <omosnace@redhat.com>

kernfs: use simple_xattrs for security attributes

Replace the special handling of security xattrs with simple_xattrs, as
is already done for the trusted xattrs. This simplifies the code and
allows LSMs to use more than just a single xattr to do their business.

Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: manual merge fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 05895219 22-Feb-2019 Ondrej Mosnacek <omosnace@redhat.com>

kernfs: clean up struct kernfs_iattrs

Right now, kernfs_iattrs embeds the whole struct iattr, even though it
doesn't really use half of its fields... This both leads to wasting
space and makes the code look awkward. Let's just list the few fields
we need directly in struct kernfs_iattrs.

Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: merged a number of chunks manually due to fuzz]
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 23bf1b6b 01-Nov-2018 David Howells <dhowells@redhat.com>

kernfs, sysfs, cgroup, intel_rdt: Support fs_context

Make kernfs support superblock creation/mount/remount with fs_context.

This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.

Notes:

(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).

(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired

(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.

(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.

(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.

(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.

(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.

Weirdies:

(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.

(*) The cgroup refcount web. This really needs documenting.

(*) cgroup2 only has one root?

Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.

[folded a leak fix from Andrey Vagin]

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 26e28d68 05-Feb-2019 Ayush Mittal <ayush.m@samsung.com>

kernfs: Allocating memory for kernfs_iattrs with kmem_cache.

Creating a new cache for kernfs_iattrs.
Currently, memory is allocated with kzalloc() which
always gives aligned memory. On ARM, this is 64 byte aligned.
To avoid the wastage of memory in aligning the size requested,
a new cache for kernfs_iattrs is created.

Size of struct kernfs_iattrs is 80 Bytes.
On ARM, it will come in kmalloc-128 slab.
and it will come in kmalloc-192 slab if debug info is enabled.
Extra bytes taken 48 bytes.

Total number of objects created : 4096
Total saving = 48*4096 = 192 KB

After creating new slab(When debug info is enabled) :
sh-3.2# cat /proc/slabinfo
...
kernfs_iattrs_cache 4069 4096 128 32 1 : tunables 0 0 0 : slabdata 128 128 0
...

All testing has been done on ARM target.

Signed-off-by: Ayush Mittal <ayush.m@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 488dee96 20-Jul-2018 Dmitry Torokhov <dmitry.torokhov@gmail.com>

kernfs: allow creating kernfs objects with arbitrary uid/gid

This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.

When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.

Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 319ba91d 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: don't set dentry->d_fsdata

When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ba16b284 12-Jul-2017 Shaohua Li <shli@fb.com>

kernfs: add an API to get kernfs node from inode number

Add an API to get kernfs node from inode number. We will need this to
implement exportfs operations.

This API will be used in blktrace too later, so it should be as fast as
possible. To make the API lock free, kernfs node is freed in RCU
context. And we depend on kernfs_node count/ino number to filter out
stale kernfs nodes.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a528d35e 31-Jan-2017 David Howells <dhowells@redhat.com>

statx: Add a system call to make enhanced file info available

Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included. The
following have been included:

(1) Make the fields a consistent size on all arches and make them large.

(2) Spare space, request flags and information flags are provided for
future expansion.

(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).

(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).

This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].

(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).

(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].

Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.

(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).

(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].

(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].

(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...

(This requires a separate system call - I have an fsinfo() call idea
for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].

(Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].

(Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).

(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].

(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.

(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.

(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data. This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};

struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};

The defined bits in request_mask and stx_mask are:

STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:

STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

STATX_ATTR_AUTOMOUNT Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

(0) stx_dev_*, stx_blksize.

These are local system information and are always available.

(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.

These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.

If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.

If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.

Note that there are instances where the type might not be valid, for
instance Windows reparse points.

(2) stx_rdev_*.

This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.

(3) stx_btime.

Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0e67db2f 27-Dec-2016 Tejun Heo <tj@kernel.org>

kernfs: add kernfs_ops->open/release() callbacks

Add ->open/release() methods to kernfs_ops. ->open() is called when
the file is opened and ->release() when the file is either released or
severed. These callbacks can be used, for example, to manage
persistent caching objects over multiple seq_file iterations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>


# e72a1a8b 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

kernfs: Switch to generic xattr handlers

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3767e255 27-May-2016 Al Viro <viro@zeniv.linux.org.uk>

switch ->setxattr() to passing dentry and inode separately

smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
we'd hashed the new dentry and attached it to inode, we need ->setxattr()
instances getting the inode as an explicit argument rather than obtaining
it from dentry.

Similar change for ->getxattr() had been done in commit ce23e64. Unlike
->getxattr() (which is used by both selinux and smack instances of
->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
it got missed back then.

Reported-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ce23e640 10-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

->getxattr(): pass dentry and inode as separate arguments

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fb02915f 18-Jun-2015 Tejun Heo <tj@kernel.org>

kernfs: make kernfs_get_inode() public

Move kernfs_get_inode() prototype from fs/kernfs/kernfs-internal.h to
include/linux/kernfs.h. It obtains the matching inode for a
kernfs_node.

It will be used by cgroup for inode based permission checks for now
but is generally useful.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a7a2c680 14-Jan-2015 Christoph Hellwig <hch@lst.de>

fs: deduplicate noop_backing_dev_info

hugetlbfs, kernfs and dlmfs can simply use noop_backing_dev_info instead
of creating a local duplicate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 7d568a83 09-Apr-2014 Tejun Heo <tj@kernel.org>

kernfs: implement kernfs_root->supers list

Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root. Let's implement it - the planned
inotify extension to kernfs_notify() needs it.

Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f41c5934 14-Feb-2014 Li Zefan <lizefan@huawei.com>

kernfs: fix kernfs_node_from_dentry()

Currently kernfs_node_from_dentry() returns NULL for root dentry,
because root_dentry->d_op == NULL.

Due to this bug cgroupstats_build() returns -EINVAL for root cgroup.

# mount -t cgroup -o cpuacct /cgroup
# Documentation/accounting/getdelays -C /cgroup
fatal reply error, errno -22

With this fix:

# Documentation/accounting/getdelays -C /cgroup
sleeping 305, blocked 0, running 1, stopped 0, uninterruptible 1

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 81c173cb 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_REMOVED

KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps that of deactivation.

It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations. There's no reason to have them separate making
things more complex than necessary.

This patch removes KERNFS_REMOVED.

* Instead of KERNFS_REMOVED, each node now starts its life
deactivated. This means that we now use both atomic_add() and
atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
generates an overflow warnings when negating INT_MIN as the negation
can't be represented as a positive number. Nothing is actually
broken but let's bump BIAS by one to avoid the warnings for archs
which negates the subtrahend..

* A new helper kernfs_active() which tests whether kn->active >= 0 is
added for convenience and lockdep annotation. All KERNFS_REMOVED
tests are replaced with negated kernfs_active() tests.

* __kernfs_remove() is updated to deactivate, but not drain, all nodes
in the subtree instead of setting KERNFS_REMOVED. This removes
deactivation from kernfs_deactivate(), which is now renamed to
kernfs_drain().

* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
checks on the active ref.

* Some comment style updates in the affected area.

v2: Reordered before removal path restructuring. kernfs_active()
dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
used in the lookup paths.

v3: Reverted most of v2 except for creating a new node with
KN_DEACTIVATED_BIAS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 988cd7af 03-Feb-2014 Tejun Heo <tj@kernel.org>

kernfs: remove kernfs_addrm_cxt

kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes. The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.

This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().

* kernfs_add_one() is updated to grab and release kernfs_mutex itself.
sysfs_addrm_start/finish() invocations around it are removed from
all users.

* __kernfs_remove() puts an unlinked node directly instead of chaining
it to kernfs_addrm_cxt. Its callers are updated to grab and release
kernfs_mutex instead of calling kernfs_addrm_start/finish() around
it.

v2: Rebased on top of "kernfs: associate a new kernfs_node with its
parent on creation" which dropped @parent from kernfs_add_one().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# db4aad20 17-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: associate a new kernfs_node with its parent on creation

Once created, a kernfs_node is always destroyed by kernfs_put().
Since ba7443bc656e ("sysfs, kernfs: implement
kernfs_create/destroy_root()"), kernfs_put() depends on kernfs_root()
to locate the ino_ida. kernfs_root() in turn depends on
kernfs_node->parent being set for !dir nodes. This means that
kernfs_put() of a !dir node requires its ->parent to be initialized.

This leads to oops when a newly created !dir node is destroyed without
going through kernfs_add_one() or after failing kernfs_add_one()
before ->parent is set. kernfs_root() invoked from kernfs_put() will
try to dereference NULL parent.

Fix it by moving parent association to kernfs_new_node() from
kernfs_add_one(). kernfs_new_node() now takes @parent instead of
@root and determines the root from the parent and also sets the new
node's parent properly. @parent parameter is removed from
kernfs_add_one(). As there's no parent when creating the root node,
__kernfs_new_node() which takes @root as before and doesn't set the
parent is used in that case.

This ensures that a kernfs_node in any stage in its life has its
parent associated and thus can be put.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 798c75a0 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove KERNFS_REMOVED"

This reverts commit ae34372eb8408b3d07e870f1939f99007a730d28.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 55f6e30d 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"

This reverts commit f601f9a2bf7dc1f7ee18feece4c4e2fc6845d6c4.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7653fe9d 13-Jan-2014 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Revert "kernfs: remove kernfs_addrm_cxt"

This reverts commit 99177a34110889a8f2c36420c34e3bcc9bfd8a70.

Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 99177a34 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove kernfs_addrm_cxt

kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes. The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.

This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().

* kernfs_add_one() is updated to grab and release the parent's active
ref and kernfs_mutex itself. kernfs_get/put_active() and
kernfs_addrm_start/finish() invocations around it are removed from
all users.

* __kernfs_remove() puts an unlinked node directly instead of chaining
it to kernfs_addrm_cxt. Its callers are updated to grab and release
kernfs_mutex instead of calling kernfs_addrm_start/finish() around
it.

v2: Updated to fit the v2 restructuring of removal path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f601f9a2 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()

kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
the target file before kernfs_remove() finishes; however, it currently
is being called from kernfs_addrm_finish() and has the same race
problem as the original implementation of deactivation when there are
multiple removers - only the remover which snatches the node to its
addrm_cxt->removed list is guaranteed to wait for its completion
before returning.

It can be fixed by moving kernfs_unmap_bin_file() invocation from
kernfs_addrm_finish() to __kernfs_remove(). The function may be
called multiple times but that shouldn't do any harm.

We end up dropping kernfs_mutex in the removal loop and the node may
be removed inbetween by someone else. kernfs_unlink_sibling() is
updated to test whether the node has already been removed and return
accordingly. __kernfs_remove() in turn performs post-unlinking
cleanup only if it actually unlinked the node.

KERNFS_HAS_MMAP test is moved out of the unmap function into
__kernfs_remove() so that we don't unlock kernfs_mutex unnecessarily.
While at it, drop the now meaningless "bin" qualifier from the
function name.

v2: Rewritten to fit the v2 restructuring of removal path. HAS_MMAP
test relocated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ae34372e 10-Jan-2014 Tejun Heo <tj@kernel.org>

kernfs: remove KERNFS_REMOVED

KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps those of deactivation and
removal from rbtree.

It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations. There's no reason to have them separate making
things more complex than necessary.

KERNFS_REMOVED is also used to decide whether a node is still visible
to vfs layer, which is rather redundant as equivalent determination
can be made by testing whether the node is on its parent's children
rbtree or not.

This patch removes KERNFS_REMOVED.

* Instead of KERNFS_REMOVED, each node now starts its life
deactivated. This means that we now use both atomic_add() and
atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
generates an overflow warnings when negating INT_MIN as the negation
can't be represented as a positive number. Nothing is actually
broken but let's bump BIAS by one to avoid the warnings for archs
which negates the subtrahend..

* KERNFS_REMOVED tests in add and rename paths are replaced with
kernfs_get/put_active() of the target nodes. Due to the way the add
path is structured now, active ref handling is done in the callers
of kernfs_add_one(). This will be consolidated up later.

* kernfs_remove_one() is updated to deactivate instead of setting
KERNFS_REMOVED. This removes deactivation from kernfs_deactivate(),
which is now renamed to kernfs_drain().

* kernfs_dop_revalidate() now tests RB_EMPTY_NODE(&kn->rb) instead of
KERNFS_REMOVED and KERNFS_REMOVED test in kernfs_dir_pos() is
dropped. A node which is removed from the children rbtree is not
included in the iteration in the first place. This means that a
node may be visible through vfs a bit longer - it's now also visible
after deactivation until the actual removal. This slightly enlarged
window difference doesn't make any difference to the userland.

* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
checks on the active ref.

* Some comment style updates in the affected area.

v2: Reordered before removal path restructuring. kernfs_active()
dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
used in the lookup paths.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2063d608 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: mark static names with KERNFS_STATIC_NAME

Because sysfs used struct attribute which are supposed to stay
constant, sysfs didn't copy names when creating regular files. The
specified string for name was supposed to stay constant. Such
distinction isn't inherent for kernfs. kernfs_create_file[_ns]()
should be able to take the same @name as kernfs_create_dir[_ns]()

As there can be huge number of sysfs attributes, we still want to be
able to use static names for sysfs attributes. This patch renames
kernfs_create_file_ns_key() to __kernfs_create_file() and adds
@name_is_static parameter so that the caller can explicitly indicate
that @name can be used without copying. kernfs is updated to use
KERNFS_STATIC_NAME to distinguish static and copied names.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c637b8ac 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in internal functions and whatever is left

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_*()/kernfs_*()/ in all internal functions
* s/sysfs/kernfs/ in internal strings, comments and whatever is remaining
* Uniformly rename various vfs operations so that they're consistently
named and distinguishable.

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a797bfc3 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in global variables

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_mutex/kernfs_mutex/
* s/sysfs_dentry_ops/kernfs_dops/
* s/sysfs_dir_operations/kernfs_dir_fops/
* s/sysfs_dir_inode_operations/kernfs_dir_iops/
* s/kernfs_file_operations/kernfs_file_fops/ - renamed for consistency
* s/sysfs_symlink_inode_operations/kernfs_symlink_iops/
* s/sysfs_aops/kernfs_aops/
* s/sysfs_backing_dev_info/kernfs_bdi/
* s/sysfs_inode_operations/kernfs_iops/
* s/sysfs_dir_cachep/kernfs_node_cache/
* s/sysfs_ops/kernfs_sops/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# df23fc39 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in constants

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/SYSFS_DIR/KERNFS_DIR/
* s/SYSFS_KOBJ_ATTR/KERNFS_FILE/
* s/SYSFS_KOBJ_LINK/KERNFS_LINK/
* s/SYSFS_{TYPE_FLAGS}/KERNFS_{TYPE_FLAGS}/
* s/SYSFS_FLAG_{FLAG}/KERNFS_{FLAG}/
* s/sysfs_type()/kernfs_type()/
* s/SD_DEACTIVATED_BIAS/KN_DEACTIVATED_BIAS/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c525aadd 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs/kernfs/ in various data structures

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_open_dirent/kernfs_open_node/
* s/sysfs_open_file/kernfs_open_file/
* s/sysfs_inode_attrs/kernfs_iattrs/
* s/sysfs_addrm_cxt/kernfs_addrm_cxt/
* s/sysfs_super_info/kernfs_super_info/
* s/sysfs_info()/kernfs_info()/
* s/sysfs_open_dirent_lock/kernfs_open_node_lock/
* s/sysfs_open_file_mutex/kernfs_open_file_mutex/
* s/sysfs_of()/kernfs_of()/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# adc5e8b5 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: drop s_ prefix from kernfs_node members

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

s_ prefix for kernfs members is used inconsistently and a misnomer
now. It's not like kernfs_node is used widely across the kernel
making the ability to grep for the members particularly useful. Let's
just drop the prefix.

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 324a56e1 11-Dec-2013 Tejun Heo <tj@kernel.org>

kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly

kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above

Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them. All are sysfs_dirent renames and trivial. While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.

This patch is strictly rename only and doesn't introduce any
functional difference.

- mic / gpio renames were missing. Spotted by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2322392b 23-Nov-2013 Tejun Heo <tj@kernel.org>

kernfs: implement "trusted.*" xattr support

kernfs inherited "security.*" xattr support from sysfs. This patch
extends xattr support to "trusted.*" using simple_xattr_*(). As
trusted xattrs are restricted to CAP_SYS_ADMIN, simple_xattr_*() which
uses kernel memory for storage shouldn't be problematic.

Note that the existing "security.*" support doesn't implement
get/remove/list and the this patch only implements those ops for
"trusted.*". We probably want to extend those ops to include support
for "security.*".

This patch will allow using kernfs from cgroup which requires
"trusted.*" xattr support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9a8049af 23-Nov-2013 Tejun Heo <tj@kernel.org>

kernfs: update sysfs_init_inode_attrs()

sysfs_init_inode_attrs() is a bit clumsy to use requiring the caller
to check whether @sd->s_iattr is already set or not. Rename it to
sysfs_inode_attrs(), update it to check whether @sd->s_iattr is
already initialized before trying to initialize it and return
@sd->s_iattr. This simplifies the callers.

While at it,

* Rename struct sysfs_inode_attrs pointer variables to "attrs". As
kernfs no longer deals with "struct attribute", this isn't confusing
and makes it easier to distinguish from struct iattr pointers.

* A new field will be added to sysfs_inode_attrs. Reindent in
preparation.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bfc5c173 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: remove cross inclusions of internal headers

fs/kernfs/kernfs-internal.h needed to include fs/sysfs/sysfs.h because
part of kernfs core implementation was living in sysfs.

fs/sysfs/sysfs.h needed to include fs/kernfs/kernfs-internal.h because
include/linux/kernfs.h didn't expose enough interface.

The separation is complete and neither is true anymore. Remove the
cross inclusion and make sysfs a proper user of kernfs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# cf9e5a73 29-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: make sysfs_dirent definition public

sysfs_dirent includes some information which should be available to
kernfs users - the type, flags, name and parent pointer. This patch
moves sysfs_dirent definition from kernfs/kernfs-internal.h to
include/linux/kernfs.h so that kernfs users can access them.

The type part of flags is exported as enum kernfs_node_type, the flags
kernfs_node_flag, sysfs_type() and kernfs_enable_ns() are moved to
include/linux/kernfs.h and the former is updated to return the enum
type. sysfs_dirent->s_parent and ->s_name are marked explicitly as
public.

This patch doesn't introduce any functional changes.

v2: Flags exported too and kernfs_enable_ns() definition moved.

v3: While moving kernfs_enable_ns() to include/linux/kernfs.h, v1 and
v2 put the definition outside CONFIG_SYSFS replacing the dummy
implementation with the actual implementation too. Unfortunately,
this can lead to oops when !CONFIG_SYSFS because
kernfs_enable_ns() may be called on a NULL @sd and now tries to
dereference @sd instead of not doing anything. This issue was
reported by Yuanhan Liu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# fa736a95 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: move mount core code to fs/kernfs/mount.c

Move core mount code to fs/kernfs/mount.c. The respective
declarations in fs/sysfs/sysfs.h are moved to
fs/kernfs/kernfs-internal.h.

This is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4b93dc9b 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: prepare mount path for kernfs

We're in the process of separating out core sysfs functionality into
kernfs which will deal with sysfs_dirents directly. This patch
rearranges mount path so that the kernfs and sysfs parts are separate.

* As sysfs_super_info won't be visible outside kernfs proper,
kernfs_super_ns() is added to allow kernfs users to access a
super_block's namespace tag.

* Generic mount operation is separated out into kernfs_mount_ns().
sysfs_mount() now just performs sysfs-specific permission check,
acquires namespace tag, and invokes kernfs_mount_ns().

* Generic superblock release is separated out into kernfs_kill_sb()
which can be used directly as file_system_type->kill_sb(). As sysfs
needs to put the namespace tag, sysfs_kill_sb() wraps
kernfs_kill_sb() with ns tag put.

* sysfs_dir_cachep init and sysfs_inode_init() are separated out into
kernfs_init(). kernfs_init() uses only small amount of memory and
trying to handle and propagate kernfs_init() failure doesn't make
much sense. Use SLAB_PANIC for sysfs_dir_cachep and make
sysfs_inode_init() panic on failure.

After this change, kernfs_init() should be called before
sysfs_init(), fs/namespace.c::mnt_init() modified accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bc755553 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: make inode number ida per kernfs_root

kernfs is being updated to allow multiple sysfs_dirent hierarchies so
that it can also be used by other users. Currently, inode number is
allocated using a global ida, sysfs_ino_ida; however, inos for
different hierarchies should be handled separately.

This patch makes ino allocation per kernfs_root. sysfs_ino_ida is
replaced by kernfs_root->ino_ida and sysfs_new_dirent() is updated to
take @root and allocate ino from it. ida_simple_get/remove() are used
instead of sysfs_ino_lock and sysfs_alloc/free_ino().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ba7443bc 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: implement kernfs_create/destroy_root()

There currently is single kernfs hierarchy in the whole system which
is used for sysfs. kernfs needs to support multiple hierarchies to
allow other users. This patch introduces struct kernfs_root which
serves as the root of each kernfs hierarchy and implements
kernfs_create/destroy_root().

* Each kernfs_root is associated with a root sd (sysfs_dentry). The
root is freed when the root sd is released and kernfs_destory_root()
simply invokes kernfs_remove() on the root sd. sysfs_remove_one()
is updated to handle release of the root sd. Note that ps_iattr
update in sysfs_remove_one() is trivially updated for readability.

* Root sd's are now dynamically allocated using sysfs_new_dirent().
Update sysfs_alloc_ino() so that it gives out ino from 1 so that the
root sd still gets ino 1.

* While kernfs currently only points to the root sd, it'll soon grow
fields which are specific to each hierarchy. As determining a given
sd's root will be necessary, sd->s_dir.root is added. This backlink
fits better as a separate field in sd; however, sd->s_dir is inside
union with space to spare, so use it to save space and provide
kernfs_root() accessor to determine the root sd.

* As hierarchies may be destroyed now, each mount needs to hold onto
the hierarchy it's attached to. Update sysfs_fill_super() and
sysfs_kill_sb() so that they get and put the kernfs_root
respectively.

* sysfs_root is replaced with kernfs_root which is dynamically created
by invoking kernfs_create_root() from sysfs_init().

This patch doesn't introduce any visible behavior changes.

v2: kernfs_create_root() forgot to set @sd->priv. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 2072f1af 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: move symlink core code to fs/kernfs/symlink.c

Move core symlink code to fs/kernfs/symlink.c. fs/sysfs/symlink.c now
only contains sysfs wrappers around kernfs interfaces. The respective
declarations in fs/sysfs/sysfs.h are moved to
fs/kernfs/kernfs-internal.h.

This is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 414985ae 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: move file core code to fs/kernfs/file.c

Move core file code to fs/kernfs/file.c. fs/sysfs/file.c now contains
sysfs kernfs_ops callbacks, sysfs wrappers around kernfs interfaces,
and sysfs_schedule_callback(). The respective declarations in
fs/sysfs/sysfs.h are moved to fs/kernfs/kernfs-internal.h.

This is pure relocation.

v2: Refreshed on top of the v2 of "sysfs, kernfs: prepare read path
for kernfs".

v3: Refreshed on top of the v3 of "sysfs, kernfs: prepare read path
for kernfs".

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# fd7b9f7b 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: move dir core code to fs/kernfs/dir.c

Move core dir code to fs/kernfs/dir.c. fs/sysfs/dir.c now only
contains sysfs_warn_dup() and sysfs wrappers around kernfs interfaces.
The respective declarations in fs/sysfs/sysfs.h are moved to
fs/kernfs/kernfs-internal.h.

This is pure relocation.

v2: sysfs_symlink_target_lock was mistakenly relocated to kernfs. It
should remain with sysfs. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ffed24e2 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: move inode code to fs/kernfs/inode.c

There's nothing sysfs-specific in fs/sysfs/inode.c. Move everything
in it to fs/kernfs/inode.c. The respective declarations in
fs/sysfs/sysfs.h are moved to fs/kernfs/kernfs-internal.h.

This is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# ae6621b0 28-Nov-2013 Tejun Heo <tj@kernel.org>

sysfs, kernfs: move internal decls to fs/kernfs/kernfs-internal.h

Move data structure, constant and basic accessor declarations from
fs/sysfs/sysfs.h to fs/kernfs/kernfs-internal.h. The two files
currently include each other. Once kernfs / sysfs separation is
complete, the cross inclusions will be removed. Inclusion protectors
are added to fs/sysfs/sysfs.h to allow cross-inclusion.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>