History log of /linux-master/fs/tracefs/inode.c
Revision Date Author Comments
# f88c3fb8 12-Mar-2024 Linus Torvalds <torvalds@linux-foundation.org>

mm, slab: remove last vestiges of SLAB_MEM_SPREAD

Yes, yes, I know the slab people were planning on going slow and letting
every subsystem fight this thing on their own. But let's just rip off
the band-aid and get it over and done with. I don't want to see a
number of unnecessary pull requests just to get rid of a flag that no
longer has any meaning.

This was mainly done with a couple of 'sed' scripts and then some manual
cleanup of the end result.

Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8dce06e98 31-Jan-2024 Linus Torvalds <torvalds@linux-foundation.org>

eventfs: Clean up dentry ops and add revalidate function

In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.

Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function. We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.

It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files. But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.

Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching. We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.

As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively. But that's a separate
issue.

Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240131185513.124644253@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 49304c2b 31-Jan-2024 Linus Torvalds <torvalds@linux-foundation.org>

tracefs: dentry lookup crapectomy

The dentry lookup for eventfs files was very broken, and had lots of
signs of the old situation where the filesystem names were all created
statically in the dentry tree, rather than being looked up dynamically
based on the eventfs data structures.

You could see it in the naming - how it claimed to "create" dentries
rather than just look up the dentries that were given it.

You could see it in various nonsensical and very incorrect operations,
like using "simple_lookup()" on the dentries that were passed in, which
only results in those dentries becoming negative dentries. Which meant
that any other lookup would possibly return ENOENT if it saw that
negative dentry before the data was then later filled in.

You could see it in the immense amount of nonsensical code that didn't
actually just do lookups.

Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240131233227.73db55e1@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# d81786f5 31-Jan-2024 Steven Rostedt (Google) <rostedt@goodmis.org>

tracefs: Zero out the tracefs_inode when allocating it

eventfs uses the tracefs_inode and assumes that it's already initialized
to zero. That is, it doesn't set fields to zero (like ti->private) after
getting its tracefs_inode. This causes bugs due to stale values.

Just initialize the entire structure to zero on allocation so there isn't
any more surprises.

This is a partial fix to access to ti->private. The assignment still needs
to be made before the dentry is instantiated.

Link: https://lore.kernel.org/linux-trace-kernel/20240131185512.315825944@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.sang@intel.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 8186fff7 03-Jan-2024 Steven Rostedt (Google) <rostedt@goodmis.org>

tracefs/eventfs: Use root and instance inodes as default ownership

Instead of walking the dentries on mount/remount to update the gid values of
all the dentries if a gid option is specified on mount, just update the root
inode. Add .getattr, .setattr, and .permissions on the tracefs inode
operations to update the permissions of the files and directories.

For all files and directories in the top level instance:

/sys/kernel/tracing/*

It will use the root inode as the default permissions. The inode that
represents: /sys/kernel/tracing (or wherever it is mounted).

When an instance is created:

mkdir /sys/kernel/tracing/instance/foo

The directory "foo" and all its files and directories underneath will use
the default of what foo is when it was created. A remount of tracefs will
not affect it.

If a user were to modify the permissions of any file or directory in
tracefs, it will also no longer be modified by a change in ownership of a
remount.

The events directory, if it is in the top level instance, will use the
tracefs root inode as the default ownership for itself and all the files and
directories below it.

For the events directory in an instance ("foo"), it will keep the ownership
of what it was when it was created, and that will be used as the default
ownership for the files and directories beneath it.

Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjVdGkjDXBbvLn2wbZnqP4UsH46E3gqJ9m7UG6DpX2+WA@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240103215016.1e0c9811@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# da549bdd 07-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

dentry: switch the lists of children to hlist

Saves a pointer per struct dentry and actually makes the things less
clumsy. Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.

A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ad579864 02-Jan-2024 Steven Rostedt (Google) <rostedt@goodmis.org>

tracefs: Check for dentry->d_inode exists in set_gid()

If a getdents() is called on the tracefs directory but does not get all
the files, it can leave a "cursor" dentry in the d_subdirs list of tracefs
dentry. This cursor dentry does not have a d_inode for it. Before
referencing tracefs_inode from the dentry, the d_inode must first be
checked if it has content. If not, then it's not a tracefs_inode and can
be ignored.

The following caused a crash:

#define getdents64(fd, dirp, count) syscall(SYS_getdents64, fd, dirp, count)
#define BUF_SIZE 256
#define TDIR "/tmp/file0"

int main(void)
{
char buf[BUF_SIZE];
int fd;
int n;

mkdir(TDIR, 0777);
mount(NULL, TDIR, "tracefs", 0, NULL);
fd = openat(AT_FDCWD, TDIR, O_RDONLY);
n = getdents64(fd, buf, BUF_SIZE);
ret = mount(NULL, TDIR, NULL, MS_NOSUID|MS_REMOUNT|MS_RELATIME|MS_LAZYTIME,
"gid=1000");
return 0;
}

That's because the 256 BUF_SIZE was not big enough to read all the
dentries of the tracefs file system and it left a "cursor" dentry in the
subdirs of the tracefs root inode. Then on remounting with "gid=1000",
it would cause an iteration of all dentries which hit:

ti = get_tracefs(dentry->d_inode);
if (ti && (ti->flags & TRACEFS_EVENT_INODE))
eventfs_update_gid(dentry, gid);

Which crashed because of the dereference of the cursor dentry which had a NULL
d_inode.

In the subdir loop of the dentry lookup of set_gid(), if a child has a
NULL d_inode, simply skip it.

Link: https://lore.kernel.org/all/20240102135637.3a21fb10@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240102151249.05da244d@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 7e8358edf503e ("eventfs: Fix file and directory uid and gid ownership")
Reported-by: "Ubisectech Sirius" <bugreport@ubisectech.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 7e8358ed 21-Dec-2023 Steven Rostedt (Google) <rostedt@goodmis.org>

eventfs: Fix file and directory uid and gid ownership

It was reported that when mounting the tracefs file system with a gid
other than root, the ownership did not carry down to the eventfs directory
due to the dynamic nature of it.

A fix was done to solve this, but it had two issues.

(a) if the attr passed into update_inode_attr() was NULL, it didn't do
anything. This is true for files that have not had a chown or chgrp
done to itself or any of its sibling files, as the attr is allocated
for all children when any one needs it.

# umount /sys/kernel/tracing
# mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt

# ls -ld /mnt/events/sched
drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/

# ls -ld /mnt/events/sched/sched_switch
drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/

But when checking the files:

# ls -l /mnt/events/sched/sched_switch
total 0
-rw-r----- 1 root root 0 Dec 21 13:12 enable
-rw-r----- 1 root root 0 Dec 21 13:12 filter
-r--r----- 1 root root 0 Dec 21 13:12 format
-r--r----- 1 root root 0 Dec 21 13:12 hist
-r--r----- 1 root root 0 Dec 21 13:12 id
-rw-r----- 1 root root 0 Dec 21 13:12 trigger

(b) When the attr does not denote the UID or GID, it defaulted to using
the parent uid or gid. This is incorrect as changing the parent
uid or gid will automatically change all its children.

# chgrp tracing /mnt/events/timer

# ls -ld /mnt/events/timer
drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer

# ls -l /mnt/events/timer
total 0
-rw-r----- 1 root root 0 Dec 21 14:35 enable
-rw-r----- 1 root root 0 Dec 21 14:35 filter
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start

At first it was thought that this could be easily fixed by just making the
default ownership of the superblock when it was mounted. But this does not
handle the case of:

# chgrp tracing instances
# mkdir instances/foo

If the superblock was used, then the group ownership would be that of what
it was when it was mounted, when it should instead be "tracing".

Instead, set a flag for the top level eventfs directory ("events") to flag
which eventfs_inode belongs to it.

Since the "events" directory's dentry and inode are never freed, it does
not need to use its attr field to restore its mode and ownership. Use the
this eventfs_inode's attr as the default ownership for all the files and
directories underneath it.

When the events eventfs_inode is created, it sets its ownership to its
parent uid and gid. As the events directory is created at boot up before
it gets mounted, this will always be uid=0 and gid=0. If it's created via
an instance, then it will take the ownership of the instance directory.

When the file system is mounted, it will update all the gids if one is
specified. This will have a callback to update the events evenfs_inode's
default entries.

When a file or directory is created under the events directory, it will
walk the ei->dentry parents until it finds the evenfs_inode that belongs
to the events directory to retrieve the default uid and gid values.

Link: https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=Gtw+AnwScyLhpFswrQ84PjhoGsg@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231221190757.7eddbca9@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dongliang Cui <cuidongliang390@gmail.com>
Cc: Hongyu Jin <hongyu.jin@unisoc.com>
Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to parent uid and gid")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# fc456122 21-Nov-2023 Steven Rostedt (Google) <rostedt@goodmis.org>

eventfs: Do not allow NULL parent to eventfs_start_creating()

The eventfs directory is dynamically created via the meta data supplied by
the existing trace events. All files and directories in eventfs has a
parent. Do not allow NULL to be passed into eventfs_start_creating() as
the parent because that should never happen. Warn if it does.

Link: https://lkml.kernel.org/r/20231121231112.693841807@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 5790b1fb 04-Oct-2023 Steven Rostedt (Google) <rostedt@goodmis.org>

eventfs: Remove eventfs_file and just use eventfs_inode

Instead of having a descriptor for every file represented in the eventfs
directory, only have the directory itself represented. Change the API to
send in a list of entries that represent all the files in the directory
(but not other directories). The entry list contains a name and a callback
function that will be used to create the files when they are accessed.

struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent,
const struct eventfs_entry *entries,
int size, void *data);

is used for the top level eventfs directory, and returns an eventfs_inode
that will be used by:

struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent,
const struct eventfs_entry *entries,
int size, void *data);

where both of the above take an array of struct eventfs_entry entries for
every file that is in the directory.

The entries are defined by:

typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);

struct eventfs_entry {
const char *name;
eventfs_callback callback;
};

Where the name is the name of the file and the callback gets called when
the file is being created. The callback passes in the name (in case the
same callback is used for multiple files), a pointer to the mode, data and
fops. The data will be pointing to the data that was passed in
eventfs_create_dir() or eventfs_create_events_dir() but may be overridden
to point to something else, as it will be used to point to the
inode->i_private that is created. The information passed back from the
callback is used to create the dentry/inode.

If the callback fills the data and the file should be created, it must
return a positive number. On zero or negative, the file is ignored.

This logic may also be used as a prototype to convert entire pseudo file
systems into just-in-time allocation.

The "show_events_dentry" file has been updated to show the directories,
and any files they have.

With just the eventfs_file allocations:

Before after deltas for meminfo (in kB):

MemFree: -14360
MemAvailable: -14260
Buffers: 40
Cached: 24
Active: 44
Inactive: 48
Inactive(anon): 28
Active(file): 44
Inactive(file): 20
Dirty: -4
AnonPages: 28
Mapped: 4
KReclaimable: 132
Slab: 1604
SReclaimable: 132
SUnreclaim: 1472
Committed_AS: 12

Before after deltas for slabinfo:

<slab>: <objects> [ * <size> = <total>]

ext4_inode_cache 27 [* 1184 = 31968 ]
extent_status 102 [* 40 = 4080 ]
tracefs_inode_cache 144 [* 656 = 94464 ]
buffer_head 39 [* 104 = 4056 ]
shmem_inode_cache 49 [* 800 = 39200 ]
filp -53 [* 256 = -13568 ]
dentry 251 [* 192 = 48192 ]
lsm_file_cache 277 [* 32 = 8864 ]
vm_area_struct -14 [* 184 = -2576 ]
trace_event_file 1748 [* 88 = 153824 ]
kmalloc-1k 35 [* 1024 = 35840 ]
kmalloc-256 49 [* 256 = 12544 ]
kmalloc-192 -28 [* 192 = -5376 ]
kmalloc-128 -30 [* 128 = -3840 ]
kmalloc-96 10581 [* 96 = 1015776 ]
kmalloc-64 3056 [* 64 = 195584 ]
kmalloc-32 1291 [* 32 = 41312 ]
kmalloc-16 2310 [* 16 = 36960 ]
kmalloc-8 9216 [* 8 = 73728 ]

Free memory dropped by 14,360 kB
Available memory dropped by 14,260 kB
Total slab additions in size: 1,771,032 bytes

With this change:

Before after deltas for meminfo (in kB):

MemFree: -12084
MemAvailable: -11976
Buffers: 32
Cached: 32
Active: 72
Inactive: 168
Inactive(anon): 176
Active(file): 72
Inactive(file): -8
Dirty: 24
AnonPages: 196
Mapped: 8
KReclaimable: 148
Slab: 836
SReclaimable: 148
SUnreclaim: 688
Committed_AS: 324

Before after deltas for slabinfo:

<slab>: <objects> [ * <size> = <total>]

tracefs_inode_cache 144 [* 656 = 94464 ]
shmem_inode_cache -23 [* 800 = -18400 ]
filp -92 [* 256 = -23552 ]
dentry 179 [* 192 = 34368 ]
lsm_file_cache -3 [* 32 = -96 ]
vm_area_struct -13 [* 184 = -2392 ]
trace_event_file 1748 [* 88 = 153824 ]
kmalloc-1k -49 [* 1024 = -50176 ]
kmalloc-256 -27 [* 256 = -6912 ]
kmalloc-128 1864 [* 128 = 238592 ]
kmalloc-64 4685 [* 64 = 299840 ]
kmalloc-32 -72 [* 32 = -2304 ]
kmalloc-16 256 [* 16 = 4096 ]
total = 721352

Free memory dropped by 12,084 kB
Available memory dropped by 11,976 kB
Total slab additions in size: 721,352 bytes

That's over 2 MB in savings per instance for free and available memory,
and over 1 MB in savings per instance of slab memory.

Link: https://lore.kernel.org/linux-trace-kernel/20231003184059.4924468e@gandalf.local.home
Link: https://lore.kernel.org/linux-trace-kernel/20231004165007.43d79161@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 079cf91e 04-Oct-2023 Jeff Layton <jlayton@kernel.org>

tracefs: convert to new timestamp accessors

Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-70-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 9879e5e1 06-Sep-2023 Steven Rostedt (Google) <rostedt@goodmis.org>

tracefs/eventfs: Use dput to free the toplevel events directory

Currently when rmdir on an instance is done, eventfs_remove_events_dir()
is called and it does a dput on the dentry and then frees the
eventfs_inode that represents the events directory.

But there's no protection against a reader reading the top level events
directory at the same time and we can get a use after free error. Instead,
use the dput() associated to the dentry to also free the eventfs_inode
associated to the events directory, as that will get called when the last
reference to the directory is released.

This issue triggered the following KASAN report:

==================================================================
BUG: KASAN: slab-use-after-free in eventfs_root_lookup+0x88/0x1b0
Read of size 8 at addr ffff888120130ca0 by task ftracetest/1201

CPU: 4 PID: 1201 Comm: ftracetest Not tainted 6.5.0-test-10737-g469e0a8194e7 #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x90
print_report+0xcf/0x670
? __pfx_ring_buffer_record_off+0x10/0x10
? _raw_spin_lock_irqsave+0x2b/0x70
? __virt_addr_valid+0xd9/0x160
kasan_report+0xd4/0x110
? eventfs_root_lookup+0x88/0x1b0
? eventfs_root_lookup+0x88/0x1b0
eventfs_root_lookup+0x88/0x1b0
? eventfs_root_lookup+0x33/0x1b0
__lookup_slow+0x194/0x2a0
? __pfx___lookup_slow+0x10/0x10
? down_read+0x11c/0x330
walk_component+0x166/0x220
link_path_walk.part.0.constprop.0+0x3a3/0x5a0
? seqcount_lockdep_reader_access+0x82/0x90
? __pfx_link_path_walk.part.0.constprop.0+0x10/0x10
path_openat+0x143/0x11f0
? __lock_acquire+0xa1a/0x3220
? __pfx_path_openat+0x10/0x10
? __pfx___lock_acquire+0x10/0x10
do_filp_open+0x166/0x290
? __pfx_do_filp_open+0x10/0x10
? lock_is_held_type+0xce/0x120
? preempt_count_sub+0xb7/0x100
? _raw_spin_unlock+0x29/0x50
? alloc_fd+0x1a0/0x320
do_sys_openat2+0x126/0x160
? rcu_is_watching+0x34/0x60
? __pfx_do_sys_openat2+0x10/0x10
? __might_resched+0x2cf/0x3b0
? __fget_light+0xdf/0x100
__x64_sys_openat+0xcd/0x140
? __pfx___x64_sys_openat+0x10/0x10
? syscall_enter_from_user_mode+0x22/0x90
? lockdep_hardirqs_on+0x7d/0x100
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7f1dceef5e51
Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 9a 27 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 93 00 00 00 48 8b 54 24 28 64 48 2b 14 25
RSP: 002b:00007fff2cddf380 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007f1dceef5e51
RDX: 0000000000000241 RSI: 000055d7520677d0 RDI: 00000000ffffff9c
RBP: 000055d7520677d0 R08: 000000000000001e R09: 0000000000000001
R10: 00000000000001b6 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000003 R14: 000055d752035678 R15: 000055d752067788
</TASK>

Allocated by task 1200:
kasan_save_stack+0x2f/0x50
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x8b/0x90
eventfs_create_events_dir+0x54/0x220
create_event_toplevel_files+0x42/0x130
event_trace_add_tracer+0x33/0x180
trace_array_create_dir+0x52/0xf0
trace_array_create+0x361/0x410
instance_mkdir+0x6b/0xb0
tracefs_syscall_mkdir+0x57/0x80
vfs_mkdir+0x275/0x380
do_mkdirat+0x1da/0x210
__x64_sys_mkdir+0x74/0xa0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Freed by task 1251:
kasan_save_stack+0x2f/0x50
kasan_set_track+0x21/0x30
kasan_save_free_info+0x27/0x40
__kasan_slab_free+0x106/0x180
__kmem_cache_free+0x149/0x2e0
event_trace_del_tracer+0xcb/0x120
__remove_instance+0x16a/0x340
instance_rmdir+0x77/0xa0
tracefs_syscall_rmdir+0x77/0xc0
vfs_rmdir+0xed/0x2d0
do_rmdir+0x235/0x280
__x64_sys_rmdir+0x5f/0x90
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

The buggy address belongs to the object at ffff888120130ca0
which belongs to the cache kmalloc-16 of size 16
The buggy address is located 0 bytes inside of
freed 16-byte region [ffff888120130ca0, ffff888120130cb0)

The buggy address belongs to the physical page:
page:000000004dbddbb0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x120130
flags: 0x17ffffc0000800(slab|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000800 ffff8881000423c0 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000800080 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888120130b80: 00 00 fc fc 00 05 fc fc 00 00 fc fc 00 02 fc fc
ffff888120130c00: 00 07 fc fc 00 00 fc fc 00 00 fc fc fa fb fc fc
>ffff888120130c80: 00 00 fc fc fa fb fc fc 00 00 fc fc 00 00 fc fc
^
ffff888120130d00: 00 00 fc fc 00 00 fc fc 00 00 fc fc fa fb fc fc
ffff888120130d80: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
==================================================================

Link: https://lkml.kernel.org/r/20230907024803.250873643@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/

Cc: Ajay Kaher <akaher@vmware.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 5bdcd5f5331a2 eventfs: ("Implement removal of meta data from eventfs")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 51aab5ff 05-Sep-2023 Steven Rostedt (Google) <rostedt@goodmis.org>

tracefs: Add missing lockdown check to tracefs_create_dir()

The function tracefs_create_dir() was missing a lockdown check and was
called by the RV code. This gave an inconsistent behavior of this function
returning success while other tracefs functions failed. This caused the
inode being freed by the wrong kmem_cache.

Link: https://lkml.kernel.org/r/20230905182711.692687042@goodmis.org
Link: https://lore.kernel.org/all/202309050916.58201dc6-oliver.sang@intel.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Cc: Ching-lin Yu <chinglinyu@google.com>
Fixes: bf8e602186ec4 ("tracing: Do not create tracefs files if tracefs lockdown is in effect")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 08662977 17-Aug-2023 Sishuai Gong <sishuai.system@gmail.com>

tracefs: Avoid changing i_mode to a temp value

Right now inode->i_mode is updated twice to reach the desired value
in tracefs_apply_options(). Because there is no lock protecting the two
writes, other threads might read the intermediate value of inode->i_mode.

Thread-1 Thread-2
// tracefs_apply_options() //e.g., acl_permission_check
inode->i_mode &= ~S_IALLUGO;
unsigned int mode = inode->i_mode;
inode->i_mode |= opts->mode;

I think there is no need to introduce a lock but it is better to
only update inode->i_mode ONCE, so the readers will either see the old
or latest value, rather than an intermediate/temporary value.

Note, the race is not a security concern as the intermediate value is more
locked down than either the start or end version. This is more just to do
the conversion cleanly.

Link: https://lore.kernel.org/linux-trace-kernel/AB5B0A1C-75D9-4E82-A7F0-CF7D0715587B@gmail.com

Signed-off-by: Sishuai Gong <sishuai.system@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 27152bce 28-Jul-2023 Ajay Kaher <akaher@vmware.com>

eventfs: Move tracing/events to eventfs

Up until now, /sys/kernel/tracing/events was no different than any other
part of tracefs. The files and directories within the events directory was
created when the tracefs was mounted, and also created for the instances in
/sys/kernel/tracing/instances/<instance>/events. Most of these files and
directories will never be referenced. Since there are thousands of these
files and directories they spend their time wasting precious memory
resources.

Move the "events" directory to the new eventfs. The eventfs will take the
meta data of the events that they represent and store that. When the files
in the events directory are referenced, the dentry and inodes to represent
them are then created. When the files are no longer referenced, they are
freed. This saves the precious memory resources that were wasted on these
seldom referenced dentries and inodes.

Running the following:

~# cat /proc/meminfo /proc/slabinfo > before.out
~# mkdir /sys/kernel/tracing/instances/foo
~# cat /proc/meminfo /proc/slabinfo > after.out

to test the changes produces the following deltas:

Before this change:

Before after deltas for meminfo:

MemFree: -32260
MemAvailable: -21496
KReclaimable: 21528
Slab: 22440
SReclaimable: 21528
SUnreclaim: 912
VmallocUsed: 16

Before after deltas for slabinfo:

<slab>: <objects> [ * <size> = <total>]

tracefs_inode_cache: 14472 [* 1184 = 17134848]
buffer_head: 24 [* 168 = 4032]
hmem_inode_cache: 28 [* 1480 = 41440]
dentry: 14450 [* 312 = 4508400]
lsm_inode_cache: 14453 [* 32 = 462496]
vma_lock: 11 [* 152 = 1672]
vm_area_struct: 2 [* 184 = 368]
trace_event_file: 1748 [* 88 = 153824]
kmalloc-256: 1072 [* 256 = 274432]
kmalloc-64: 2842 [* 64 = 181888]

Total slab additions in size: 22,763,400 bytes

With this change:

Before after deltas for meminfo:

MemFree: -12600
MemAvailable: -12580
Cached: 24
Active: 12
Inactive: 68
Inactive(anon): 48
Active(file): 12
Inactive(file): 20
Dirty: -4
AnonPages: 68
KReclaimable: 12
Slab: 1856
SReclaimable: 12
SUnreclaim: 1844
KernelStack: 16
PageTables: 36
VmallocUsed: 16

Before after deltas for slabinfo:

<slab>: <objects> [ * <size> = <total>]

tracefs_inode_cache: 108 [* 1184 = 127872]
buffer_head: 24 [* 168 = 4032]
hmem_inode_cache: 18 [* 1480 = 26640]
dentry: 127 [* 312 = 39624]
lsm_inode_cache: 152 [* 32 = 4864]
vma_lock: 67 [* 152 = 10184]
vm_area_struct: -12 [* 184 = -2208]
trace_event_file: 1764 [* 96 = 169344]
kmalloc-96: 14322 [* 96 = 1374912]
kmalloc-64: 2814 [* 64 = 180096]
kmalloc-32: 1103 [* 32 = 35296]
kmalloc-16: 2308 [* 16 = 36928]
kmalloc-8: 12800 [* 8 = 102400]

Total slab additions in size: 2,109,984 bytes

Which is a savings of 20,653,416 bytes (20 MB) per tracing instance.

Link: https://lkml.kernel.org/r/1690568452-46553-10-git-send-email-akaher@vmware.com

Signed-off-by: Ajay Kaher <akaher@vmware.com>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# a3760079 28-Jul-2023 Ajay Kaher <akaher@vmware.com>

eventfs: Implement functions to create files and dirs when accessed

Add create_file() and create_dir() functions to create the files and
directories respectively when they are accessed. The functions will be
called from the lookup operation of the inode_operations or from the open
function of file_operations.

Link: https://lkml.kernel.org/r/1690568452-46553-8-git-send-email-akaher@vmware.com

Signed-off-by: Ajay Kaher <akaher@vmware.com>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 2c6b6b10 28-Jul-2023 Ajay Kaher <akaher@vmware.com>

tracefs: Rename and export some tracefs functions

Export a few tracefs functions that will be needed by the eventfs dynamic
file system. Rename them to start with "tracefs_" to keep with the name
space.

start_creating -> tracefs_start_creating
failed_creating -> tracefs_failed_creating
end_creating -> tracefs_end_creating

Link: https://lkml.kernel.org/r/1690568452-46553-4-git-send-email-akaher@vmware.com

Signed-off-by: Ajay Kaher <akaher@vmware.com>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# ba37ff75 28-Jul-2023 Ajay Kaher <akaher@vmware.com>

eventfs: Implement tracefs_inode_cache

Create a kmem cache of tracefs_inodes. To be more efficient, as there are
lots of tracefs inodes, create its own cache. This also allows to see how
many tracefs inodes have been created.

Add helper functions:
tracefs_alloc_inode()
tracefs_free_inode()
get_tracefs()

Link: https://lkml.kernel.org/r/1690568452-46553-3-git-send-email-akaher@vmware.com

Signed-off-by: Ajay Kaher <akaher@vmware.com>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# bb9c40e6 05-Jul-2023 Jeff Layton <jlayton@kernel.org>

tracefs: convert to ctime accessor functions

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-75-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# c54bd91e 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->mkdir() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 47311db8 26-Aug-2022 Brian Norris <briannorris@chromium.org>

tracefs: Only clobber mode/uid/gid on remount if asked

Users may have explicitly configured their tracefs permissions; we
shouldn't overwrite those just because a second mount appeared.

Only clobber if the options were provided at mount time.

Note: the previous behavior was especially surprising in the presence of
automounted /sys/kernel/debug/tracing/.

Existing behavior:

## Pre-existing status: tracefs is 0755.
# stat -c '%A' /sys/kernel/tracing/
drwxr-xr-x

## (Re)trigger the automount.
# umount /sys/kernel/debug/tracing
# stat -c '%A' /sys/kernel/debug/tracing/.
drwx------

## Unexpected: the automount changed mode for other mount instances.
# stat -c '%A' /sys/kernel/tracing/
drwx------

New behavior (after this change):

## Pre-existing status: tracefs is 0755.
# stat -c '%A' /sys/kernel/tracing/
drwxr-xr-x

## (Re)trigger the automount.
# umount /sys/kernel/debug/tracing
# stat -c '%A' /sys/kernel/debug/tracing/.
drwxr-xr-x

## Expected: the automount does not change other mount instances.
# stat -c '%A' /sys/kernel/tracing/
drwxr-xr-x

Link: https://lkml.kernel.org/r/20220826174353.2.Iab6e5ea57963d6deca5311b27fb7226790d44406@changeid

Cc: stable@vger.kernel.org
Fixes: 4282d60689d4f ("tracefs: Add new tracefs file system")
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 93a8c044 05-Jun-2022 Xiang wangx <wangxiang@cdjrlc.com>

tracefs: Fix syntax errors in comments

Delete the redundant word 'to'.

Link: https://lkml.kernel.org/r/20220605092729.13010-1-wangxiang@cdjrlc.com

Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 851e99eb 25-Feb-2022 Steven Rostedt (Google) <rostedt@goodmis.org>

tracefs: Set the group ownership in apply_options() not parse_options()

Al Viro brought it to my attention that the dentries may not be filled
when the parse_options() is called, causing the call to set_gid() to
possibly crash. It should only be called if parse_options() succeeds
totally anyway.

He suggested the logical place to do the update is in apply_options().

Link: https://lore.kernel.org/all/20220225165219.737025658@goodmis.org/
Link: https://lkml.kernel.org/r/20220225153426.1c4cab6b@gandalf.local.home

Cc: stable@vger.kernel.org
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 48b27b6b5191 ("tracefs: Set all files to the same group ownership as the mount option")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# ee34c52c 08-Dec-2021 Steven Rostedt (VMware) <rostedt@goodmis.org>

tracefs: Use d_inode() helper function to get the dentry inode

Instead of referencing the inode from a dentry via dentry->d_inode, use
the helper function d_inode(dentry) instead. This is the considered the
correct way to access it.

Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Reported: https://lore.kernel.org/all/20211208104454.nhxyvmmn6d2qhpwl@wittgenstein/
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 48b27b6b 07-Dec-2021 Steven Rostedt (VMware) <rostedt@goodmis.org>

tracefs: Set all files to the same group ownership as the mount option

As people have been asking to allow non-root processes to have access to
the tracefs directory, it was considered best to only allow groups to have
access to the directory, where it is easier to just set the tracefs file
system to a specific group (as other would be too dangerous), and that way
the admins could pick which processes would have access to tracefs.

Unfortunately, this broke tooling on Android that expected the other bit
to be set. For some special cases, for non-root tools to trace the system,
tracefs would be mounted and change the permissions of the top level
directory which gave access to all running tasks permission to the
tracing directory. Even though this would be dangerous to do in a
production environment, for testing environments this can be useful.

Now with the new changes to not allow other (which is still the proper
thing to do), it breaks the testing tooling. Now more code needs to be
loaded on the system to change ownership of the tracing directory.

The real solution is to have tracefs honor the gid=xxx option when
mounting. That is,

(tracing group tracing has value 1003)

mount -t tracefs -o gid=1003 tracefs /sys/kernel/tracing

should have it that all files in the tracing directory should be of the
given group.

Copy the logic from d_walk() from dcache.c and simplify it for the mount
case of tracefs if gid is set. All the files in tracefs will be walked and
their group will be set to the value passed in.

Link: https://lkml.kernel.org/r/20211207171729.2a54e1b3@gandalf.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Yabin Cui <yabinc@google.com>
Fixes: 49d67e445742 ("tracefs: Have tracefs directories not set OTH permission bits by default")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# ee7f3666 08-Dec-2021 Steven Rostedt (VMware) <rostedt@goodmis.org>

tracefs: Have new files inherit the ownership of their parent

If directories in tracefs have their ownership changed, then any new files
and directories that are created under those directories should inherit
the ownership of the director they are created in.

Link: https://lkml.kernel.org/r/20211208075720.4855d180@gandalf.local.home

Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Yabin Cui <yabinc@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: stable@vger.kernel.org
Fixes: 4282d60689d4f ("tracefs: Add new tracefs file system")
Reported-by: Kalesh Singh <kaleshsingh@google.com>
Reported: https://lore.kernel.org/all/CAC_TJve8MMAv+H_NdLSJXZUSoxOEq2zB_pVaJ9p=7H6Bu3X76g@mail.gmail.com/
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 49d67e44 18-Aug-2021 Steven Rostedt (VMware) <rostedt@goodmis.org>

tracefs: Have tracefs directories not set OTH permission bits by default

The tracefs file system is by default mounted such that only root user can
access it. But there are legitimate reasons to create a group and allow
those added to the group to have access to tracing. By changing the
permissions of the tracefs mount point to allow access, it will allow
group access to the tracefs directory.

There should not be any real reason to allow all access to the tracefs
directory as it contains sensitive information. Have the default
permission of directories being created not have any OTH (other) bits set,
such that an admin that wants to give permission to a group has to first
disable all OTH bits in the file system.

Link: https://lkml.kernel.org/r/20210818153038.664127804@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# f2cc020d 23-Mar-2021 Ingo Molnar <mingo@kernel.org>

tracing: Fix various typos in comments

Fix ~59 single-word typos in the tracing code comments, and fix
the grammar in a handful of places.

Link: https://lore.kernel.org/r/20210322224546.GA1981273@gmail.com
Link: https://lkml.kernel.org/r/20210323174935.GA4176821@gmail.com

Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# a3d1e7eb 18-Nov-2019 Al Viro <viro@zeniv.linux.org.uk>

simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems

two requirements: no file creations in IS_DEADDIR and no cross-directory
renames whatsoever.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bf8e6021 11-Oct-2019 Steven Rostedt (VMware) <rostedt@goodmis.org>

tracing: Do not create tracefs files if tracefs lockdown is in effect

If on boot up, lockdown is activated for tracefs, don't even bother creating
the files. This can also prevent instances from being created if lockdown is
in effect.

Link: http://lkml.kernel.org/r/CAHk-=whC6Ji=fWnjh2+eS4b15TnbsS4VPVtvBOwCy1jjEG_JHQ@mail.gmail.com

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 3ed270b1 11-Oct-2019 Steven Rostedt (VMware) <rostedt@goodmis.org>

tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")

Running the latest kernel through my "make instances" stress tests, I
triggered the following bug (with KASAN and kmemleak enabled):

mkdir invoked oom-killer:
gfp_mask=0x40cd0(GFP_KERNEL|__GFP_COMP|__GFP_RECLAIMABLE), order=0,
oom_score_adj=0
CPU: 1 PID: 2229 Comm: mkdir Not tainted 5.4.0-rc2-test #325
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
Call Trace:
dump_stack+0x64/0x8c
dump_header+0x43/0x3b7
? trace_hardirqs_on+0x48/0x4a
oom_kill_process+0x68/0x2d5
out_of_memory+0x2aa/0x2d0
__alloc_pages_nodemask+0x96d/0xb67
__alloc_pages_node+0x19/0x1e
alloc_slab_page+0x17/0x45
new_slab+0xd0/0x234
___slab_alloc.constprop.86+0x18f/0x336
? alloc_inode+0x2c/0x74
? irq_trace+0x12/0x1e
? tracer_hardirqs_off+0x1d/0xd7
? __slab_alloc.constprop.85+0x21/0x53
__slab_alloc.constprop.85+0x31/0x53
? __slab_alloc.constprop.85+0x31/0x53
? alloc_inode+0x2c/0x74
kmem_cache_alloc+0x50/0x179
? alloc_inode+0x2c/0x74
alloc_inode+0x2c/0x74
new_inode_pseudo+0xf/0x48
new_inode+0x15/0x25
tracefs_get_inode+0x23/0x7c
? lookup_one_len+0x54/0x6c
tracefs_create_file+0x53/0x11d
trace_create_file+0x15/0x33
event_create_dir+0x2a3/0x34b
__trace_add_new_event+0x1c/0x26
event_trace_add_tracer+0x56/0x86
trace_array_create+0x13e/0x1e1
instance_mkdir+0x8/0x17
tracefs_syscall_mkdir+0x39/0x50
? get_dname+0x31/0x31
vfs_mkdir+0x78/0xa3
do_mkdirat+0x71/0xb0
sys_mkdir+0x19/0x1b
do_fast_syscall_32+0xb0/0xed

I bisected this down to the addition of the proxy_ops into tracefs for
lockdown. It appears that the allocation of the proxy_ops and then freeing
it in the destroy_inode callback, is causing havoc with the memory system.
Reading the documentation about destroy_inode and talking with Linus about
this, this is buggy and wrong. When defining the destroy_inode() method, it
is expected that the destroy_inode() will also free the inode, and not just
the extra allocations done in the creation of the inode. The faulty commit
causes a memory leak of the inode data structure when they are deleted.

Instead of allocating the proxy_ops (and then having to free it) the checks
should be done by the open functions themselves, and not hack into the
tracefs directory. First revert the tracefs updates for locked_down and then
later we can add the locked_down checks in the kernel/trace files.

Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home

Fixes: ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# ccbd54ff 19-Aug-2019 Matthew Garrett <matthewgarrett@google.com>

tracefs: Restrict tracefs when the kernel is locked down

Tracefs may release more information about the kernel than desirable, so
restrict it when the kernel is locked down in confidentiality mode by
preventing open().

(Fixed by Ben Hutchings to avoid a null dereference in
default_file_open())

Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>


# 4bf23774 26-May-2019 Amir Goldstein <amir73il@gmail.com>

tracefs: call fsnotify_{unlink,rmdir}() hooks

This will allow generating fsnotify delete events after the
fsnotify_nameremove() hook is removed from d_delete().

Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 5248ee85 25-Jul-2018 Zubin Mithra <zsm@chromium.org>

tracefs: Annotate tracefs_ops with __ro_after_init

tracefs_ops is initialized inside tracefs_create_instance_dir and not
modified after. tracefs_create_instance_dir allows for initialization
only once, and is called from create_trace_instances(marked __init),
which is called from tracer_init_tracefs(marked __init). Also, mark
tracefs_create_instance_dir as __init.

Link: http://lkml.kernel.org/r/20180725171901.4468-1-zsm@chromium.org

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Zubin Mithra <zsm@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# c3d98ea0 05-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Don't use save/replace_mount_options if not using generic_show_options

btrfs, debugfs, reiserfs and tracefs call save_mount_options() and reiserfs
calls replace_mount_options(), but they then implement their own
->show_options() methods and don't touch s_options, rendering the saved
options unnecessary. I'm trying to eliminate s_options to make it easier
to implement a context-based mount where the mount options can be passed
individually over a file descriptor.

Remove the calls to save/replace_mount_options() call in these cases.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Chris Mason <clm@fb.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Steven Rostedt <rostedt@goodmis.org>
cc: linux-btrfs@vger.kernel.org
cc: reiserfs-devel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cda37124 25-Mar-2017 Eric Biggers <ebiggers@google.com>

fs: constify tree_descr arrays passed to simple_fill_super()

simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory. Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 078cd827 14-Sep-2016 Deepa Dinamani <deepa.kernel@gmail.com>

fs: Replace CURRENT_TIME with current_time() for inode timestamps

CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0ba3353c 29-May-2016 Al Viro <viro@zeniv.linux.org.uk>

tracefs: ->d_parent is never NULL or negative...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d227c3ae 04-Nov-2015 Daniel Borkmann <daniel@iogearbox.net>

tracefs: Fix refcount imbalance in start_creating()

In tracefs' start_creating(), we pin the file system to safely access
its root. When we failed to create a file, we unpin the file system via
failed_creating() to release the mount count and eventually the reference
of the singleton vfsmount.

However, when we run into an error during lookup_one_len() when still
in start_creating(), we only release the parent's mutex but not so the
reference on the mount.

F.e., in securityfs_create_file(), after doing simple_pin_fs() when
lookup_one_len() fails there, we infact do simple_release_fs(). This
seems necessary here as well.

Same issue seen in debugfs due to 190afd81e4a5 ("debugfs: split the
beginning and the end of __create_file() off"), which seemed to got
carried over into tracefs, too. Noticed during code review.

Link: http://lkml.kernel.org/r/68efa86101b778cf7517ed7c6ad573bd69f60ec6.1446672850.git.daniel@iogearbox.net

Fixes: 4282d60689d4 ("tracefs: Add new tracefs file system")
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# f9bb4882 13-May-2015 Eric W. Biederman <ebiederm@xmission.com>

sysfs: Create mountpoints with sysfs_create_mount_point

This allows for better documentation in the code and
it allows for a simpler and fully correct version of
fs_fully_visible to be written.

The mount points converted and their filesystems are:
/sys/hypervisor/s390/ s390_hypfs
/sys/kernel/config/ configfs
/sys/kernel/debug/ debugfs
/sys/firmware/efi/efivars/ efivarfs
/sys/fs/fuse/connections/ fusectl
/sys/fs/pstore/ pstore
/sys/kernel/tracing/ tracefs
/sys/fs/cgroup/ cgroup
/sys/kernel/security/ securityfs
/sys/fs/selinux/ selinuxfs
/sys/fs/smackfs/ smackfs

Cc: stable@vger.kernel.org
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# dc3f4198 18-May-2015 Al Viro <viro@zeniv.linux.org.uk>

make simple_positive() public

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# eae47358 21-Jan-2015 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

tracing: Have mkdir and rmdir be part of tracefs

The tracing "instances" directory can create sub tracing buffers
with mkdir, and remove them with rmdir. As a mkdir will also create
all the files and directories that control the sub buffer the inode
mutexes need to be released before this is done, to avoid deadlocks.
It is better to let the tracing system unlock the inode mutexes before
calling the functions that create the files within the new directory
(or deletes the files from the one being destroyed).

Now that tracing has been converted over to tracefs, the tracefs file
system can be modified to accommodate this feature. It still releases
the locks, but the filesystem itself can take care of the ugly
business and let the user just do what it needs.

The tracing system now attaches a descriptor to the directory dentry
that can have userspace create or remove sub directories. If this
descriptor does not exist for a dentry, then that dentry can not be
used to create other directories. This descriptor holds a mkdir and
rmdir method that only takes a character string as an argument.

The tracefs file system will first make a copy of the dentry name
before releasing the locks. Then it will pass the copied name to the
methods. It is up to the tracing system that supplied the methods to
handle races with duplicate names and such as all the inode mutexes
would be released when the functions are called.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# cc31004a 21-Jan-2015 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

tracefs: Add directory /sys/kernel/tracing

When tracefs is configured, have the directory /sys/kernel/tracing appear
just like /sys/kernel/debug appears when debugfs is configured.

This will give a consistent place for system admins to mount tracefs.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 4282d606 20-Jan-2015 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

tracefs: Add new tracefs file system

Add a separate file system to handle the tracing directory. Currently it
is part of debugfs, but that is starting to show its limits.

One thing is that in order to access the tracing infrastructure, you need
to mount debugfs. As that includes debugging from all sorts of sub systems
in the kernel, it is not considered advisable to mount such an all
encompassing debugging system.

Having the tracing system in its own file systems gives access to the
tracing sub system without needing to include all other systems.

Another problem with tracing using the debugfs system is that the
instances use mkdir to create sub buffers. debugfs does not support mkdir
from userspace so to implement it, special hacks were used. By controlling
the file system that the tracing infrastructure uses, this can be properly
done without hacks.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>