Searched +hist:3 +hist:eef34ad (Results 1 - 3 of 3) sorted by relevance

/linux-master/fs/sysfs/
H A Ddir.cdiff 3abb1d90 Wed Aug 10 09:23:44 MDT 2016 Tejun Heo <tj@kernel.org> kernfs: make kernfs_path*() behave in the style of strlcpy()

kernfs_path*() functions always return the length of the full path but
the path content is undefined if the length is larger than the
provided buffer. This makes its behavior different from strlcpy() and
requires error handling in all its users even when they don't care
about truncation. In addition, the implementation can actully be
simplified by making it behave properly in strlcpy() style.

* Update kernfs_path_from_node_locked() to always fill up the buffer
with path. If the buffer is not large enough, the output is
truncated and terminated.

* kernfs_path() no longer needs error handling. Make it a simple
inline wrapper around kernfs_path_from_node().

* sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
Updated accordingly.

* cgroup_path()'s use of kernfs_path() updated to retain the old
behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
diff 3eef34ad Fri Feb 07 11:32:07 MST 2014 Tejun Heo <tj@kernel.org> kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends

kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.

Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.

* kernfs_name() - format the node's name into the specified buffer
* kernfs_path() - format the node's path into the specified buffer
* pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent() - pin and return a node's parent

All can be called under any context. The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.

v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
static inline making it cause a lot of build warnings. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 3eef34ad Fri Feb 07 11:32:07 MST 2014 Tejun Heo <tj@kernel.org> kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends

kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.

Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.

* kernfs_name() - format the node's name into the specified buffer
* kernfs_path() - format the node's path into the specified buffer
* pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent() - pin and return a node's parent

All can be called under any context. The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.

v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
static inline making it cause a lot of build warnings. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 785a162d Mon Oct 14 07:27:11 MDT 2013 Tejun Heo <tj@kernel.org> sysfs: make sysfs_file_ops() follow ignore_lockdep flag

375b611e60 ("sysfs: remove sysfs_buffer->ops") introduced
sysfs_file_ops() which determines the associated file operation of a
given sysfs_dirent. As file ops access should be protected by an
active reference, the new function includes a lockdep assertion on the
sysfs_dirent; unfortunately, I forgot to take attr->ignore_lockdep
flag into account and the lockdep assertion trips spuriously for files
which opt out from active reference lockdep checking.

# cat /sys/devices/pci0000:00/0000:00:01.2/usb1/authorized

------------[ cut here ]------------
WARNING: CPU: 1 PID: 540 at /work/os/work/fs/sysfs/file.c:79 sysfs_file_ops+0x4e/0x60()
Modules linked in:
CPU: 1 PID: 540 Comm: cat Not tainted 3.11.0-work+ #3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000009 ffff880016205c08 ffffffff81ca0131 0000000000000000
ffff880016205c40 ffffffff81096d0d ffff8800166cb898 ffff8800166f6f60
ffffffff8125a220 ffff880011ab1ec0 ffff88000aff0c78 ffff880016205c50
Call Trace:
[<ffffffff81ca0131>] dump_stack+0x4e/0x82
[<ffffffff81096d0d>] warn_slowpath_common+0x7d/0xa0
[<ffffffff81096dea>] warn_slowpath_null+0x1a/0x20
[<ffffffff8125994e>] sysfs_file_ops+0x4e/0x60
[<ffffffff8125a274>] sysfs_open_file+0x54/0x300
[<ffffffff811df612>] do_dentry_open.isra.17+0x182/0x280
[<ffffffff811df820>] finish_open+0x30/0x40
[<ffffffff811f0623>] do_last+0x503/0xd90
[<ffffffff811f0f6b>] path_openat+0xbb/0x6d0
[<ffffffff811f23ba>] do_filp_open+0x3a/0x90
[<ffffffff811e09a9>] do_sys_open+0x129/0x220
[<ffffffff811e0abe>] SyS_open+0x1e/0x20
[<ffffffff81caf3c2>] system_call_fastpath+0x16/0x1b
---[ end trace aa48096b111dafdb ]---

Rename fs/sysfs/dir.c::ignore_lockdep() to sysfs_ignore_lockdep() and
move it to fs/sysfs/sysfs.h and make sysfs_file_ops() skip lockdep
assertion if sysfs_ignore_lockdep() is true.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 3db3c625 Fri Mar 08 08:07:27 MST 2013 Maarten Lankhorst <m.b.lankhorst@gmail.com> sysfs: use atomic_inc_unless_negative in sysfs_get_active

It seems that sysfs has an interesting way of doing the same thing.
This removes the cpu_relax unfortunately, but if it's really needed,
it would be better to add this to include/linux/atomic.h to benefit
all atomic ops users.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 3a198886 Fri Apr 06 14:41:06 MDT 2012 Dan Williams <dan.j.williams@intel.com> sysfs: handle 'parent deleted before child added'

In scsi at least two cases of the parent device being deleted before the
child is added have been observed.

1/ scsi is performing async scans and the device is removed prior to the
async can thread running (can happen with an in-opportune / unlikely
unplug during initial scan).

2/ libsas discovery event running after the parent port has been torn
down (this is a bug in libsas).

Result in crash signatures like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffff8115e100>] sysfs_create_dir+0x32/0xb6
...
Process scsi_scan_8 (pid: 5417, threadinfo ffff88080bd16000, task ffff880801b8a0b0)
Stack:
00000000fffffffe ffff880813470628 ffff88080bd17cd0 ffff88080614b7e8
ffff88080b45c108 00000000fffffffe ffff88080bd17d20 ffffffff8125e4a8
ffff88080bd17cf0 ffffffff81075149 ffff88080bd17d30 ffff88080614b7e8
Call Trace:
[<ffffffff8125e4a8>] kobject_add_internal+0x120/0x1e3
[<ffffffff81075149>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff8125e641>] kobject_add_varg+0x41/0x50
[<ffffffff8125e70b>] kobject_add+0x64/0x66
[<ffffffff8131122b>] device_add+0x12d/0x63a

In this scenario the parent is still valid (because we have a
reference), but it has been device_del()'d which means its kobj->sd
pointer is NULL'd via:

device_del()->kobject_del()->sysfs_remove_dir()

...and then sysfs_create_dir() (without this fix) goes ahead and
de-references parent_sd via sysfs_ns_type():

return (sd->s_flags & SYSFS_NS_TYPE_MASK) >> SYSFS_NS_TYPE_SHIFT;

This scenario is being fixed in scsi/libsas, but if other subsystems
present the same ordering the system need not immediately crash.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 58f2a4c7 Thu Jul 21 18:01:12 MDT 2011 Mikulas Patocka <mpatocka@redhat.com> sysfs: remove s_sibling hacks

sysfs: remove s_sibling hacks

s_sibling was used for three different purposes:
1) as a linked list of entries in the directory
2) as a linked list of entries to be deleted
3) as a pointer to "struct completion"

This patch removes the hack and introduces new union u which
holds pointers for cases 2) and 3).

This change is needed for the following patch that removes s_sibling at all
and replaces it with a rb tree.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
diff 58f2a4c7 Thu Jul 21 18:01:12 MDT 2011 Mikulas Patocka <mpatocka@redhat.com> sysfs: remove s_sibling hacks

sysfs: remove s_sibling hacks

s_sibling was used for three different purposes:
1) as a linked list of entries in the directory
2) as a linked list of entries to be deleted
3) as a pointer to "struct completion"

This patch removes the hack and introduces new union u which
holds pointers for cases 2) and 3).

This change is needed for the following patch that removes s_sibling at all
and replaces it with a rb tree.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
diff 3ff195b0 Tue Mar 30 12:31:26 MDT 2010 Eric W. Biederman <ebiederm@xmission.com> sysfs: Implement sysfs tagged directory support.

The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.

What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.

I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.

For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.

To achieve a reace free design all tagged directories are created
and managed by sysfs itself.

Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid

- Implement mount_ns() which returns the ns of the calling process
so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.

Everything else is left up to sysfs and the driver layer.

For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.

Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.

The work needed in sysfs is more extensive. At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent. Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.

Currently only directories which hold kobjects, and
symlinks are supported. There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
diff 3efa65b9 Mon Aug 20 06:36:30 MDT 2007 Eric W. Biederman <ebiederm@xmission.com> sysfs: Simplify readdir.

At some point someone wrote sysfs_readdir to insert a cursor
into the list of sysfs_dirents to ensure that sysfs_readdir would
restart properly. That works but it is complex code and tends
to be expensive.

The same effect can be achieved by keeping the sysfs_dirents in
inode order and using the inode number as the f_pos. Then
when we restart we just have to find the first dirent whose inode
number is equal or greater then the last sysfs_dirent we attempted
to return.

Removing the sysfs directory cursor also allows the remove of
all of the mysterious checks for sysfs_type(sd) != 0. Which
were nonbovious checks to see if a cursor was in a directory list.

tj: offset marker for EOF is changed from UINT_MAX to INT_MAX to avoid
overflow in case offset is 32bit.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
/linux-master/include/linux/
H A Dkernfs.hdiff 79038a99 Mon Jul 24 06:18:16 MDT 2023 Arnd Bergmann <arnd@arndb.de> kernfs: add stub helper for kernfs_generic_poll()

In some randconfig builds, kernfs ends up being disabled, so there is no prototype
for kernfs_generic_poll()

In file included from kernel/sched/build_utility.c:97:
kernel/sched/psi.c:1479:3: error: implicit declaration of function 'kernfs_generic_poll' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
kernfs_generic_poll(t->of, wait);
^

Add a stub helper for it, as we have it for other kernfs functions.

Fixes: aff037078ecae ("sched/psi: use kernfs polling functions for PSI trigger polling")
Fixes: 147e1a97c4a0b ("fs: kernfs: add poll file operation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20230724121823.1357562-1-arnd@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff c2549174 Sat Aug 27 23:04:37 MDT 2022 Tejun Heo <tj@kernel.org> kernfs: Add KERNFS_REMOVING flags

KERNFS_ACTIVATED tracks whether a given node has ever been activated. As a
node was only deactivated on removal, this was used for

1. Drain optimization (removed by the previous patch).
2. To hide !activated nodes
3. To avoid double activations
4. Reject adding children to a node being removed
5. Skip activaing a node which is being removed.

We want to decouple deactivation from removal so that nodes can be
deactivated and hidden dynamically, which makes KERNFS_ACTIVATED useless for
all of the above purposes.

#1 is already gone. #2 and #3 can instead test whether the node is currently
active. A new flag KERNFS_REMOVING is added to explicitly mark nodes which
are being removed for #4 and #5.

While this leaves KERNFS_ACTIVATED with no users, leave it be as it will be
used in a following patch.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-7-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff c2549174 Sat Aug 27 23:04:37 MDT 2022 Tejun Heo <tj@kernel.org> kernfs: Add KERNFS_REMOVING flags

KERNFS_ACTIVATED tracks whether a given node has ever been activated. As a
node was only deactivated on removal, this was used for

1. Drain optimization (removed by the previous patch).
2. To hide !activated nodes
3. To avoid double activations
4. Reject adding children to a node being removed
5. Skip activaing a node which is being removed.

We want to decouple deactivation from removal so that nodes can be
deactivated and hidden dynamically, which makes KERNFS_ACTIVATED useless for
all of the above purposes.

#1 is already gone. #2 and #3 can instead test whether the node is currently
active. A new flag KERNFS_REMOVING is added to explicitly mark nodes which
are being removed for #4 and #5.

While this leaves KERNFS_ACTIVATED with no users, leave it be as it will be
used in a following patch.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-7-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff b8f35fa1 Tue Jun 14 08:10:57 MDT 2022 Imran Khan <imran.f.khan@oracle.com> kernfs: Change kernfs_notify_list to llist.

At present kernfs_notify_list is implemented as a singly linked
list of kernfs_node(s), where last element points to itself and
value of ->attr.next tells if node is present on the list or not.
Both addition and deletion to list happen under kernfs_notify_lock.

Change kernfs_notify_list to llist so that addition to list can heppen
locklessly.

Suggested by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20220615021059.862643-3-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 29356624 Sun Sep 12 23:41:10 MDT 2021 Christoph Hellwig <hch@lst.de> kernfs: remove kernfs_create_file and kernfs_create_file_ns

All callers actually use __kernfs_create_file.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210913054121.616001-3-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 23bf1b6b Thu Nov 01 17:07:26 MDT 2018 David Howells <dhowells@redhat.com> kernfs, sysfs, cgroup, intel_rdt: Support fs_context

Make kernfs support superblock creation/mount/remount with fs_context.

This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.

Notes:

(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).

(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired

(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.

(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.

(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.

(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.

(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.

Weirdies:

(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.

(*) The cgroup refcount web. This really needs documenting.

(*) cgroup2 only has one root?

Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.

[folded a leak fix from Andrey Vagin]

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 8f5be0ec Mon Aug 13 00:52:09 MDT 2018 Konstantin Khlebnikov <koct9i@gmail.com> kernfs: update comment about kernfs_path() return value

Now it returns the length of the full path or error code.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 3abb1d90f5d9 ("kernfs: make kernfs_path*() behave in the style of strlcpy()")
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 3abb1d90 Wed Aug 10 09:23:44 MDT 2016 Tejun Heo <tj@kernel.org> kernfs: make kernfs_path*() behave in the style of strlcpy()

kernfs_path*() functions always return the length of the full path but
the path content is undefined if the length is larger than the
provided buffer. This makes its behavior different from strlcpy() and
requires error handling in all its users even when they don't care
about truncation. In addition, the implementation can actully be
simplified by making it behave properly in strlcpy() style.

* Update kernfs_path_from_node_locked() to always fill up the buffer
with path. If the buffer is not large enough, the output is
truncated and terminated.

* kernfs_path() no longer needs error handling. Make it a simple
inline wrapper around kernfs_path_from_node().

* sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
Updated accordingly.

* cgroup_path()'s use of kernfs_path() updated to retain the old
behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
diff e4234a1f Thu Mar 31 04:45:06 MDT 2016 Chris Wilson <chris@chris-wilson.co.uk> kernfs: Move faulting copy_user operations outside of the mutex

A fault in a user provided buffer may lead anywhere, and lockdep warns
that we have a potential deadlock between the mm->mmap_sem and the
kernfs file mutex:

[ 82.811702] ======================================================
[ 82.811705] [ INFO: possible circular locking dependency detected ]
[ 82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
[ 82.811711] -------------------------------------------------------
[ 82.811714] kms_setmode/5859 is trying to acquire lock:
[ 82.811717] (&dev->struct_mutex){+.+.+.}, at: [<ffffffff8150d9c1>] drm_gem_mmap+0x1a1/0x270
[ 82.811731]
but task is already holding lock:
[ 82.811734] (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
[ 82.811745]
which lock already depends on the new lock.

[ 82.811749]
the existing dependency chain (in reverse order) is:
[ 82.811752]
-> #3 (&mm->mmap_sem){++++++}:
[ 82.811761] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.811766] [<ffffffff8118bc65>] __might_fault+0x75/0xa0
[ 82.811771] [<ffffffff8124da4a>] kernfs_fop_write+0x8a/0x180
[ 82.811787] [<ffffffff811d1023>] __vfs_write+0x23/0xe0
[ 82.811792] [<ffffffff811d1d74>] vfs_write+0xa4/0x190
[ 82.811797] [<ffffffff811d2c14>] SyS_write+0x44/0xb0
[ 82.811801] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.811807]
-> #2 (s_active#6){++++.+}:
[ 82.811814] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.811819] [<ffffffff8124c070>] __kernfs_remove+0x210/0x2f0
[ 82.811823] [<ffffffff8124d040>] kernfs_remove_by_name_ns+0x40/0xa0
[ 82.811828] [<ffffffff8124e9e0>] sysfs_remove_file_ns+0x10/0x20
[ 82.811832] [<ffffffff815318d4>] device_del+0x124/0x250
[ 82.811837] [<ffffffff81531a19>] device_unregister+0x19/0x60
[ 82.811841] [<ffffffff8153c051>] cpu_cache_sysfs_exit+0x51/0xb0
[ 82.811846] [<ffffffff8153c628>] cacheinfo_cpu_callback+0x38/0x70
[ 82.811851] [<ffffffff8109ae89>] notifier_call_chain+0x39/0xa0
[ 82.811856] [<ffffffff8109aef9>] __raw_notifier_call_chain+0x9/0x10
[ 82.811860] [<ffffffff810786de>] cpu_notify+0x1e/0x40
[ 82.811865] [<ffffffff81078779>] cpu_notify_nofail+0x9/0x20
[ 82.811869] [<ffffffff81078ac3>] _cpu_down+0x233/0x340
[ 82.811874] [<ffffffff81079019>] disable_nonboot_cpus+0xc9/0x350
[ 82.811878] [<ffffffff810d2e11>] suspend_devices_and_enter+0x5a1/0xb50
[ 82.811883] [<ffffffff810d3903>] pm_suspend+0x543/0x8d0
[ 82.811888] [<ffffffff810d1b77>] state_store+0x77/0xe0
[ 82.811892] [<ffffffff813fa68f>] kobj_attr_store+0xf/0x20
[ 82.811897] [<ffffffff8124e740>] sysfs_kf_write+0x40/0x50
[ 82.811902] [<ffffffff8124dafc>] kernfs_fop_write+0x13c/0x180
[ 82.811906] [<ffffffff811d1023>] __vfs_write+0x23/0xe0
[ 82.811910] [<ffffffff811d1d74>] vfs_write+0xa4/0x190
[ 82.811914] [<ffffffff811d2c14>] SyS_write+0x44/0xb0
[ 82.811918] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.811923]
-> #1 (cpu_hotplug.lock){+.+.+.}:
[ 82.811929] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.811933] [<ffffffff817b6f72>] mutex_lock_nested+0x62/0x3b0
[ 82.811940] [<ffffffff810784c1>] get_online_cpus+0x61/0x80
[ 82.811944] [<ffffffff811170eb>] stop_machine+0x1b/0xe0
[ 82.811949] [<ffffffffa0178edd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
[ 82.812009] [<ffffffffa017d3a6>] ggtt_bind_vma+0x46/0x70 [i915]
[ 82.812045] [<ffffffffa017eb70>] i915_vma_bind+0x140/0x290 [i915]
[ 82.812081] [<ffffffffa01862b9>] i915_gem_object_do_pin+0x899/0xb00 [i915]
[ 82.812117] [<ffffffffa0186555>] i915_gem_object_pin+0x35/0x40 [i915]
[ 82.812154] [<ffffffffa019a23e>] intel_init_pipe_control+0xbe/0x210 [i915]
[ 82.812192] [<ffffffffa0197312>] intel_logical_rings_init+0xe2/0xde0 [i915]
[ 82.812232] [<ffffffffa0186fe3>] i915_gem_init+0xf3/0x130 [i915]
[ 82.812278] [<ffffffffa02097ed>] i915_driver_load+0xf2d/0x1770 [i915]
[ 82.812318] [<ffffffff81512474>] drm_dev_register+0xa4/0xb0
[ 82.812323] [<ffffffff8151467e>] drm_get_pci_dev+0xce/0x1e0
[ 82.812328] [<ffffffffa01472cf>] i915_pci_probe+0x2f/0x50 [i915]
[ 82.812360] [<ffffffff8143f907>] pci_device_probe+0x87/0xf0
[ 82.812366] [<ffffffff81535f89>] driver_probe_device+0x229/0x450
[ 82.812371] [<ffffffff81536233>] __driver_attach+0x83/0x90
[ 82.812375] [<ffffffff81533c61>] bus_for_each_dev+0x61/0xa0
[ 82.812380] [<ffffffff81535879>] driver_attach+0x19/0x20
[ 82.812384] [<ffffffff8153535f>] bus_add_driver+0x1ef/0x290
[ 82.812388] [<ffffffff81536e9b>] driver_register+0x5b/0xe0
[ 82.812393] [<ffffffff8143e83b>] __pci_register_driver+0x5b/0x60
[ 82.812398] [<ffffffff81514866>] drm_pci_init+0xd6/0x100
[ 82.812402] [<ffffffffa027c094>] 0xffffffffa027c094
[ 82.812406] [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
[ 82.812412] [<ffffffff811595a0>] do_init_module+0x5b/0x1cb
[ 82.812417] [<ffffffff81106160>] load_module+0x1c20/0x2480
[ 82.812422] [<ffffffff81106bae>] SyS_finit_module+0x7e/0xa0
[ 82.812428] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.812433]
-> #0 (&dev->struct_mutex){+.+.+.}:
[ 82.812439] [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
[ 82.812443] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.812456] [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
[ 82.812460] [<ffffffff81196a14>] mmap_region+0x334/0x580
[ 82.812466] [<ffffffff81196fc4>] do_mmap+0x364/0x410
[ 82.812470] [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
[ 82.812474] [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
[ 82.812479] [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
[ 82.812484] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[ 82.812489]
other info that might help us debug this:

[ 82.812493] Chain exists of:
&dev->struct_mutex --> s_active#6 --> &mm->mmap_sem

[ 82.812502] Possible unsafe locking scenario:

[ 82.812506] CPU0 CPU1
[ 82.812508] ---- ----
[ 82.812510] lock(&mm->mmap_sem);
[ 82.812514] lock(s_active#6);
[ 82.812519] lock(&mm->mmap_sem);
[ 82.812522] lock(&dev->struct_mutex);
[ 82.812526]
*** DEADLOCK ***

[ 82.812531] 1 lock held by kms_setmode/5859:
[ 82.812533] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
[ 82.812541]
stack backtrace:
[ 82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
[ 82.812550] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
[ 82.812553] 0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
[ 82.812560] ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
[ 82.812566] ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
[ 82.812573] Call Trace:
[ 82.812578] [<ffffffff813f8505>] dump_stack+0x67/0x92
[ 82.812582] [<ffffffff810c84ac>] print_circular_bug+0x1fc/0x310
[ 82.812586] [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
[ 82.812590] [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[ 82.812594] [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
[ 82.812599] [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
[ 82.812603] [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
[ 82.812608] [<ffffffff81196a14>] mmap_region+0x334/0x580
[ 82.812612] [<ffffffff81196fc4>] do_mmap+0x364/0x410
[ 82.812616] [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
[ 82.812629] [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
[ 82.812633] [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
[ 82.812637] [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73

Highly unlikely though this scenario is, we can avoid the issue entirely
by moving the copy operation from out under the kernfs_get_active()
tracking by assigning the preallocated buffer its own mutex. The
temporary buffer allocation doesn't require mutex locking as it is
entirely local.

The locked section was extended by the addition of the preallocated buf
to speed up md user operations in

commit 2b75869bba676c248d8d25ae6d2bd9221dfffdb6
Author: NeilBrown <neilb@suse.de>
Date: Mon Oct 13 16:41:28 2014 +1100

sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff fed95bab Tue Feb 25 04:28:44 MST 2014 Li Zefan <lizefan@huawei.com> sysfs: fix namespace refcnt leak

As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

v3:
- Make the new argument as second-to-last arg, suggested by Tejun.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/kernfs/mount.c | 8 +++++++-
fs/sysfs/mount.c | 5 +++--
include/linux/kernfs.h | 9 +++++----
3 files changed, 15 insertions(+), 7 deletions(-)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
/linux-master/fs/kernfs/
H A Ddir.cdiff e3977e06 Tue Jan 09 14:48:02 MST 2024 Tejun Heo <tj@kernel.org> Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"

This reverts commit dad3fb67ca1cbef87ce700e83a55835e5921ce8a.

The commit converted kernfs_idr_lock to an IRQ-safe raw_spinlock because it
could be acquired while holding an rq lock through bpf_cgroup_from_id().
However, kernfs_idr_lock is held while doing GPF_NOWAIT allocations which
involves acquiring an non-IRQ-safe and non-raw lock leading to the following
lockdep warning:

=============================
[ BUG: Invalid wait context ]
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
other info that might help us debug this:
context-{5:5}
2 locks held by swapper/0/0:
#0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
#1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from local_lock_acquire+0x28/0xa4
local_lock_acquire from ___slab_alloc+0x234/0x8a8
___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
__slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
idr_get_free from idr_alloc_u32+0x9c/0x108
idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
__kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
kernfs_create_root from sysfs_init+0x18/0x5c
sysfs_init from mnt_init+0xc4/0x220
mnt_init from vfs_caches_init+0x6c/0x88
vfs_caches_init from start_kernel+0x474/0x528
start_kernel from 0x0

Let's rever the commit. It's undesirable to spread out raw spinlock usage
anyway and the problem can be solved by protecting the lookup path with RCU
instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/CAMuHMdV=AKt+mwY7svEq5gFPx41LoSQZ_USME5_MEdWQze13ww@mail.gmail.com
Link: https://lore.kernel.org/r/20240109214828.252092-2-tj@kernel.org
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff e3977e06 Tue Jan 09 14:48:02 MST 2024 Tejun Heo <tj@kernel.org> Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"

This reverts commit dad3fb67ca1cbef87ce700e83a55835e5921ce8a.

The commit converted kernfs_idr_lock to an IRQ-safe raw_spinlock because it
could be acquired while holding an rq lock through bpf_cgroup_from_id().
However, kernfs_idr_lock is held while doing GPF_NOWAIT allocations which
involves acquiring an non-IRQ-safe and non-raw lock leading to the following
lockdep warning:

=============================
[ BUG: Invalid wait context ]
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
other info that might help us debug this:
context-{5:5}
2 locks held by swapper/0/0:
#0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
#1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from local_lock_acquire+0x28/0xa4
local_lock_acquire from ___slab_alloc+0x234/0x8a8
___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
__slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
idr_get_free from idr_alloc_u32+0x9c/0x108
idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
__kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
kernfs_create_root from sysfs_init+0x18/0x5c
sysfs_init from mnt_init+0xc4/0x220
mnt_init from vfs_caches_init+0x6c/0x88
vfs_caches_init from start_kernel+0x474/0x528
start_kernel from 0x0

Let's rever the commit. It's undesirable to spread out raw spinlock usage
anyway and the problem can be solved by protecting the lookup path with RCU
instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/CAMuHMdV=AKt+mwY7svEq5gFPx41LoSQZ_USME5_MEdWQze13ww@mail.gmail.com
Link: https://lore.kernel.org/r/20240109214828.252092-2-tj@kernel.org
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff e3977e06 Tue Jan 09 14:48:02 MST 2024 Tejun Heo <tj@kernel.org> Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"

This reverts commit dad3fb67ca1cbef87ce700e83a55835e5921ce8a.

The commit converted kernfs_idr_lock to an IRQ-safe raw_spinlock because it
could be acquired while holding an rq lock through bpf_cgroup_from_id().
However, kernfs_idr_lock is held while doing GPF_NOWAIT allocations which
involves acquiring an non-IRQ-safe and non-raw lock leading to the following
lockdep warning:

=============================
[ BUG: Invalid wait context ]
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
other info that might help us debug this:
context-{5:5}
2 locks held by swapper/0/0:
#0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
#1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from local_lock_acquire+0x28/0xa4
local_lock_acquire from ___slab_alloc+0x234/0x8a8
___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
__slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
idr_get_free from idr_alloc_u32+0x9c/0x108
idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
__kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
kernfs_create_root from sysfs_init+0x18/0x5c
sysfs_init from mnt_init+0xc4/0x220
mnt_init from vfs_caches_init+0x6c/0x88
vfs_caches_init from start_kernel+0x474/0x528
start_kernel from 0x0

Let's rever the commit. It's undesirable to spread out raw spinlock usage
anyway and the problem can be solved by protecting the lookup path with RCU
instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/CAMuHMdV=AKt+mwY7svEq5gFPx41LoSQZ_USME5_MEdWQze13ww@mail.gmail.com
Link: https://lore.kernel.org/r/20240109214828.252092-2-tj@kernel.org
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff e3977e06 Tue Jan 09 14:48:02 MST 2024 Tejun Heo <tj@kernel.org> Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"

This reverts commit dad3fb67ca1cbef87ce700e83a55835e5921ce8a.

The commit converted kernfs_idr_lock to an IRQ-safe raw_spinlock because it
could be acquired while holding an rq lock through bpf_cgroup_from_id().
However, kernfs_idr_lock is held while doing GPF_NOWAIT allocations which
involves acquiring an non-IRQ-safe and non-raw lock leading to the following
lockdep warning:

=============================
[ BUG: Invalid wait context ]
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
other info that might help us debug this:
context-{5:5}
2 locks held by swapper/0/0:
#0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
#1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from local_lock_acquire+0x28/0xa4
local_lock_acquire from ___slab_alloc+0x234/0x8a8
___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
__slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
idr_get_free from idr_alloc_u32+0x9c/0x108
idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
__kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
kernfs_create_root from sysfs_init+0x18/0x5c
sysfs_init from mnt_init+0xc4/0x220
mnt_init from vfs_caches_init+0x6c/0x88
vfs_caches_init from start_kernel+0x474/0x528
start_kernel from 0x0

Let's rever the commit. It's undesirable to spread out raw spinlock usage
anyway and the problem can be solved by protecting the lookup path with RCU
instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/CAMuHMdV=AKt+mwY7svEq5gFPx41LoSQZ_USME5_MEdWQze13ww@mail.gmail.com
Link: https://lore.kernel.org/r/20240109214828.252092-2-tj@kernel.org
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff ff6d413b Tue Dec 12 14:17:40 MST 2023 Kees Cook <keescook@chromium.org> kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()

One of the last remaining users of strlcpy() in the kernel is
kernfs_path_from_node_locked(), which passes back the problematic "length
we _would_ have copied" return value to indicate truncation. Convert the
chain of all callers to use the negative return value (some of which
already doing this explicitly). All callers were already also checking
for negative return values, so the risk to missed checks looks very low.

In this analysis, it was found that cgroup1_release_agent() actually
didn't handle the "too large" condition, so this is technically also a
bug fix. :)

Here's the chain of callers, and resolution identifying each one as now
handling the correct return value:

kernfs_path_from_node_locked()
kernfs_path_from_node()
pr_cont_kernfs_path()
returns void
kernfs_path()
sysfs_warn_dup()
return value ignored
cgroup_path()
blkg_path()
bfq_bic_update_cgroup()
return value ignored
TRACE_IOCG_PATH()
return value ignored
TRACE_CGROUP_PATH()
return value ignored
perf_event_cgroup()
return value ignored
task_group_path()
return value ignored
damon_sysfs_memcg_path_eq()
return value ignored
get_mm_memcg_path()
return value ignored
lru_gen_seq_show()
return value ignored
cgroup_path_from_kernfs_id()
return value ignored
cgroup_show_path()
already converted "too large" error to negative value
cgroup_path_ns_locked()
cgroup_path_ns()
bpf_iter_cgroup_show_fdinfo()
return value ignored
cgroup1_release_agent()
wasn't checking "too large" error
proc_cgroup_show()
already converted "too large" to negative value

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: <cgroups@vger.kernel.org>
Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff ff6d413b Tue Dec 12 14:17:40 MST 2023 Kees Cook <keescook@chromium.org> kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()

One of the last remaining users of strlcpy() in the kernel is
kernfs_path_from_node_locked(), which passes back the problematic "length
we _would_ have copied" return value to indicate truncation. Convert the
chain of all callers to use the negative return value (some of which
already doing this explicitly). All callers were already also checking
for negative return values, so the risk to missed checks looks very low.

In this analysis, it was found that cgroup1_release_agent() actually
didn't handle the "too large" condition, so this is technically also a
bug fix. :)

Here's the chain of callers, and resolution identifying each one as now
handling the correct return value:

kernfs_path_from_node_locked()
kernfs_path_from_node()
pr_cont_kernfs_path()
returns void
kernfs_path()
sysfs_warn_dup()
return value ignored
cgroup_path()
blkg_path()
bfq_bic_update_cgroup()
return value ignored
TRACE_IOCG_PATH()
return value ignored
TRACE_CGROUP_PATH()
return value ignored
perf_event_cgroup()
return value ignored
task_group_path()
return value ignored
damon_sysfs_memcg_path_eq()
return value ignored
get_mm_memcg_path()
return value ignored
lru_gen_seq_show()
return value ignored
cgroup_path_from_kernfs_id()
return value ignored
cgroup_show_path()
already converted "too large" error to negative value
cgroup_path_ns_locked()
cgroup_path_ns()
bpf_iter_cgroup_show_fdinfo()
return value ignored
cgroup1_release_agent()
wasn't checking "too large" error
proc_cgroup_show()
already converted "too large" to negative value

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: <cgroups@vger.kernel.org>
Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff c9f2dfb7 Thu Mar 09 04:09:31 MST 2023 Imran Khan <imran.f.khan@oracle.com> kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.

Right now per-fs kernfs_rwsem protects list of kernfs_super_info instances
for a kernfs_root. Since kernfs_rwsem is used to synchronize several other
operations across kernfs and since most of these operations don't impact
kernfs_super_info, we can use a separate per-fs rwsem to synchronize access
to list of kernfs_super_info.
This helps in reducing contention around kernfs_rwsem and also allows
operations that change/access list of kernfs_super_info to proceed without
contending for kernfs_rwsem.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20230309110932.2889010-3-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 4abc9965 Tue Sep 13 06:17:23 MDT 2022 Christian A. Ehrhardt <lk@c--e.de> kernfs: fix use-after-free in __kernfs_remove

Syzkaller managed to trigger concurrent calls to
kernfs_remove_by_name_ns() for the same file resulting in
a KASAN detected use-after-free. The race occurs when the root
node is freed during kernfs_drain().

To prevent this acquire an additional reference for the root
of the tree that is removed before calling __kernfs_remove().

Found by syzkaller with the following reproducer (slab_nomerge is
required):

syz_mount_image$ext4(0x0, &(0x7f0000000100)='./file0\x00', 0x100000, 0x0, 0x0, 0x0, 0x0)
r0 = openat(0xffffffffffffff9c, &(0x7f0000000080)='/proc/self/exe\x00', 0x0, 0x0)
close(r0)
pipe2(&(0x7f0000000140)={0xffffffffffffffff, <r1=>0xffffffffffffffff}, 0x800)
mount$9p_fd(0x0, &(0x7f0000000040)='./file0\x00', &(0x7f00000000c0), 0x408, &(0x7f0000000280)={'trans=fd,', {'rfdno', 0x3d, r0}, 0x2c, {'wfdno', 0x3d, r1}, 0x2c, {[{@cache_loose}, {@mmap}, {@loose}, {@loose}, {@mmap}], [{@mask={'mask', 0x3d, '^MAY_EXEC'}}, {@fsmagic={'fsmagic', 0x3d, 0x10001}}, {@dont_hash}]}})

Sample report:

==================================================================
BUG: KASAN: use-after-free in kernfs_type include/linux/kernfs.h:335 [inline]
BUG: KASAN: use-after-free in kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline]
BUG: KASAN: use-after-free in __kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369
Read of size 2 at addr ffff8880088807f0 by task syz-executor.2/857

CPU: 0 PID: 857 Comm: syz-executor.2 Not tainted 6.0.0-rc3-00363-g7726d4c3e60b #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x6e/0x91 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:317 [inline]
print_report.cold+0x5e/0x5e5 mm/kasan/report.c:433
kasan_report+0xa3/0x130 mm/kasan/report.c:495
kernfs_type include/linux/kernfs.h:335 [inline]
kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline]
__kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369
__kernfs_remove fs/kernfs/dir.c:1356 [inline]
kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589
sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943
__kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
create_cache mm/slab_common.c:229 [inline]
kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
p9_client_create+0xd4d/0x1190 net/9p/client.c:993
v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
vfs_get_tree+0x85/0x2e0 fs/super.c:1530
do_new_mount fs/namespace.c:3040 [inline]
path_mount+0x675/0x1d00 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x282/0x300 fs/namespace.c:3568
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f725f983aed
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f725f0f7028 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f725faa3f80 RCX: 00007f725f983aed
RDX: 00000000200000c0 RSI: 0000000020000040 RDI: 0000000000000000
RBP: 00007f725f9f419c R08: 0000000020000280 R09: 0000000000000000
R10: 0000000000000408 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000006 R14: 00007f725faa3f80 R15: 00007f725f0d7000
</TASK>

Allocated by task 855:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:45 [inline]
set_alloc_info mm/kasan/common.c:437 [inline]
__kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:470
kasan_slab_alloc include/linux/kasan.h:224 [inline]
slab_post_alloc_hook mm/slab.h:727 [inline]
slab_alloc_node mm/slub.c:3243 [inline]
slab_alloc mm/slub.c:3251 [inline]
__kmem_cache_alloc_lru mm/slub.c:3258 [inline]
kmem_cache_alloc+0xbf/0x200 mm/slub.c:3268
kmem_cache_zalloc include/linux/slab.h:723 [inline]
__kernfs_new_node+0xd4/0x680 fs/kernfs/dir.c:593
kernfs_new_node fs/kernfs/dir.c:655 [inline]
kernfs_create_dir_ns+0x9c/0x220 fs/kernfs/dir.c:1010
sysfs_create_dir_ns+0x127/0x290 fs/sysfs/dir.c:59
create_dir lib/kobject.c:63 [inline]
kobject_add_internal+0x24a/0x8d0 lib/kobject.c:223
kobject_add_varg lib/kobject.c:358 [inline]
kobject_init_and_add+0x101/0x160 lib/kobject.c:441
sysfs_slab_add+0x156/0x1e0 mm/slub.c:5954
__kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
create_cache mm/slab_common.c:229 [inline]
kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
p9_client_create+0xd4d/0x1190 net/9p/client.c:993
v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
vfs_get_tree+0x85/0x2e0 fs/super.c:1530
do_new_mount fs/namespace.c:3040 [inline]
path_mount+0x675/0x1d00 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x282/0x300 fs/namespace.c:3568
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 857:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:45
kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:367 [inline]
____kasan_slab_free mm/kasan/common.c:329 [inline]
__kasan_slab_free+0x108/0x190 mm/kasan/common.c:375
kasan_slab_free include/linux/kasan.h:200 [inline]
slab_free_hook mm/slub.c:1754 [inline]
slab_free_freelist_hook mm/slub.c:1780 [inline]
slab_free mm/slub.c:3534 [inline]
kmem_cache_free+0x9c/0x340 mm/slub.c:3551
kernfs_put.part.0+0x2b2/0x520 fs/kernfs/dir.c:547
kernfs_put+0x42/0x50 fs/kernfs/dir.c:521
__kernfs_remove.part.0+0x72d/0x960 fs/kernfs/dir.c:1407
__kernfs_remove fs/kernfs/dir.c:1356 [inline]
kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589
sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943
__kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
create_cache mm/slab_common.c:229 [inline]
kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
p9_client_create+0xd4d/0x1190 net/9p/client.c:993
v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
vfs_get_tree+0x85/0x2e0 fs/super.c:1530
do_new_mount fs/namespace.c:3040 [inline]
path_mount+0x675/0x1d00 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x282/0x300 fs/namespace.c:3568
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

The buggy address belongs to the object at ffff888008880780
which belongs to the cache kernfs_node_cache of size 128
The buggy address is located 112 bytes inside of
128-byte region [ffff888008880780, ffff888008880800)

The buggy address belongs to the physical page:
page:00000000732833f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8880
flags: 0x100000000000200(slab|node=0|zone=1)
raw: 0100000000000200 0000000000000000 dead000000000122 ffff888001147280
raw: 0000000000000000 0000000000150015 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888008880680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
ffff888008880700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888008880780: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888008880800: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
ffff888008880880: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================

Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable <stable@kernel.org> # -rc3
Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>
Link: https://lore.kernel.org/r/20220913121723.691454-1-lk@c--e.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff c2549174 Sat Aug 27 23:04:37 MDT 2022 Tejun Heo <tj@kernel.org> kernfs: Add KERNFS_REMOVING flags

KERNFS_ACTIVATED tracks whether a given node has ever been activated. As a
node was only deactivated on removal, this was used for

1. Drain optimization (removed by the previous patch).
2. To hide !activated nodes
3. To avoid double activations
4. Reject adding children to a node being removed
5. Skip activaing a node which is being removed.

We want to decouple deactivation from removal so that nodes can be
deactivated and hidden dynamically, which makes KERNFS_ACTIVATED useless for
all of the above purposes.

#1 is already gone. #2 and #3 can instead test whether the node is currently
active. A new flag KERNFS_REMOVING is added to explicitly mark nodes which
are being removed for #4 and #5.

While this leaves KERNFS_ACTIVATED with no users, leave it be as it will be
used in a following patch.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-7-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff c2549174 Sat Aug 27 23:04:37 MDT 2022 Tejun Heo <tj@kernel.org> kernfs: Add KERNFS_REMOVING flags

KERNFS_ACTIVATED tracks whether a given node has ever been activated. As a
node was only deactivated on removal, this was used for

1. Drain optimization (removed by the previous patch).
2. To hide !activated nodes
3. To avoid double activations
4. Reject adding children to a node being removed
5. Skip activaing a node which is being removed.

We want to decouple deactivation from removal so that nodes can be
deactivated and hidden dynamically, which makes KERNFS_ACTIVATED useless for
all of the above purposes.

#1 is already gone. #2 and #3 can instead test whether the node is currently
active. A new flag KERNFS_REMOVING is added to explicitly mark nodes which
are being removed for #4 and #5.

While this leaves KERNFS_ACTIVATED with no users, leave it be as it will be
used in a following patch.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-7-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Completed in 277 milliseconds