Searched +hist:5 +hist:e4a0847 (Results 1 - 5 of 5) sorted by relevance

/linux-master/kernel/
H A Dutsname.cdiff 5b825c3a Thu Feb 02 09:54:15 MST 2017 Ingo Molnar <mingo@kernel.org> sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff 5e4a0847 Fri Dec 14 08:55:36 MST 2012 Eric W. Biederman <ebiederm@xmission.com> userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
diff 5e4a0847 Fri Dec 14 08:55:36 MST 2012 Eric W. Biederman <ebiederm@xmission.com> userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
H A Dpid_namespace.cdiff 9876cfe8 Mon Aug 14 02:41:00 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy

This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
to set this sysctl to a more restrictive option in the host pidns you
would need to reboot your machine in order to reset it.

The justification given in [1] is that this is a security feature and thus
it should not be possible to disable. Aside from the fact that we have
plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it. A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):

% cat mount-memfd.c
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/mount.h>

#define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"

int main(void)
{
int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
assert(fsfd >= 0);
assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));

int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
assert(dfd >= 0);

int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
assert(execfd >= 0);
assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
assert(!close(execfd));

char *execpath = NULL;
char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
assert(execfd >= 0);
assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
assert(!execve(execpath, argv, envp));
}
% ./mount-memfd
this file was executed from this totally private tmpfs: /proc/self/fd/5
%

Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host filesystem
(not to mention the many other things a CAP_SYS_ADMIN process would be
able to do that would be equivalent or worse), it seems strange to cause a
fair amount of headache to admins when there doesn't appear to be an
actual security benefit to blocking this. There appear to be concerns
about confused-deputy-esque attacks[2] but a confused deputy that can
write to arbitrary sysctls is a bigger security issue than executable
memfds.

/* New API */

The primary requirement from the original author appears to be more based
on the need to be able to restrict an entire system in a hierarchical
manner[3], such that child namespaces cannot re-enable executable memfds.

So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you. The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.

Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time. This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).

/* Backwards Compatibility */

As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.

However it should be noted that now that the setting is completely
hierarchical. Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
wouldn't propoagate to existing pid namespaces. Now, the restriction
applies recursively. This is a uAPI change, however:

* The sysctl is very new, having been merged in 6.3.
* Several aspects of the sysctl were broken up until this patchset and
the other patchset by Jeff Xu last month.

And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.

[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/

Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 28319d6d Fri Nov 25 06:55:00 MST 2022 Frederic Weisbecker <frederic@kernel.org> rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()

RCU Tasks and PID-namespace unshare can interact in do_exit() in a
complicated circular dependency:

1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
that every subsequent child of TASK A will belong to. But TASK A
doesn't itself belong to that new PID namespace.

2) TASK A forks() and creates TASK B. TASK A stays attached to its PID
namespace (let's say PID_NS1) and TASK B is the first task belonging
to the new PID namespace created by unshare() (let's call it PID_NS2).

3) Since TASK B is the first task attached to PID_NS2, it becomes the
PID_NS2 child reaper.

4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.

5) TASK B exits and since it is the child reaper for PID_NS2, it has to
kill all other tasks attached to PID_NS2, and wait for all of them to
die before getting reaped itself (zap_pid_ns_process()).

6) TASK A calls synchronize_rcu_tasks() which leads to
synchronize_srcu(&tasks_rcu_exit_srcu).

7) TASK B is waiting for TASK C to get reaped. But TASK B is under a
tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between
exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A.

8) TASK C exits and since TASK A is its parent, it waits for it to reap
TASK C, but it can't because TASK A waits for TASK B that waits for
TASK C.

Pid_namespace semantics can hardly be changed at this point. But the
coverage of tasks_rcu_exit_srcu can be reduced instead.

The current task is assumed not to be concurrently reapable at this
stage of exit_notify() and therefore tasks_rcu_exit_srcu can be
temporarily relaxed without breaking its constraints, providing a way
out of the deadlock scenario.

[ paulmck: Fix build failure by adding additional declaration. ]

Fixes: 3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Eric W . Biederman <ebiederm@xmission.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff fab827db Thu Sep 02 15:54:57 MDT 2021 Vasily Averin <vvs@virtuozzo.com> memcg: enable accounting for pids in nested pid namespaces

Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep,
but forgot to adjust the setting for nested pid namespaces. As a result,
pid memory is not accounted exactly where it is really needed, inside
memcg-limited containers with their own pid namespaces.

Pid was one the first kernel objects enabled for memcg accounting.
init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any
new pids in the system are memcg-accounted.

Though recently I've noticed that it is wrong. nested pid namespaces
creates own slab caches for pid objects, nested pids have increased size
because contain id both for all parent and for own pid namespaces. The
problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a
result any pids allocated in nested pid namespaces are not
memcg-accounted.

Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000
such objects gives us up to ~50Mb unaccounted memory, this allow container
to exceed assigned memcg limits.

Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com
Fixes: 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff fab827db Thu Sep 02 15:54:57 MDT 2021 Vasily Averin <vvs@virtuozzo.com> memcg: enable accounting for pids in nested pid namespaces

Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep,
but forgot to adjust the setting for nested pid namespaces. As a result,
pid memory is not accounted exactly where it is really needed, inside
memcg-limited containers with their own pid namespaces.

Pid was one the first kernel objects enabled for memcg accounting.
init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any
new pids in the system are memcg-accounted.

Though recently I've noticed that it is wrong. nested pid namespaces
creates own slab caches for pid objects, nested pids have increased size
because contain id both for all parent and for own pid namespaces. The
problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a
result any pids allocated in nested pid namespaces are not
memcg-accounted.

Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000
such objects gives us up to ~50Mb unaccounted memory, this allow container
to exceed assigned memcg limits.

Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com
Fixes: 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 5b825c3a Thu Feb 02 09:54:15 MST 2017 Ingo Molnar <mingo@kernel.org> sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff 5cc54451 Tue Apr 30 16:28:27 MDT 2013 Raphael S.Carvalho <raphael.scarv@gmail.com> pid_namespace.c/.h: simplify defines

Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

[akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
Signed-off-by: Raphael S.Carvalho <raphael.scarv@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 5e4a0847 Fri Dec 14 08:55:36 MST 2012 Eric W. Biederman <ebiederm@xmission.com> userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
diff 5e4a0847 Fri Dec 14 08:55:36 MST 2012 Eric W. Biederman <ebiederm@xmission.com> userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
/linux-master/ipc/
H A Dnamespace.cdiff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 0cfb6aee Fri Sep 08 17:17:55 MDT 2017 Guillaume Knispel <guillaume.knispel@supersonicimagine.com> ipc: optimize semget/shmget/msgget for lots of keys

ipc_findkey() used to scan all objects to look for the wanted key. This
is slow when using a high number of keys. This change adds an rhashtable
of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).

This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]

Other (more micro) benchmark results, by the author: On an i5 laptop, the
following loop executed right after a reboot took, without and with this
change:

for (int i = 0, k=0x424242; i < KEYS; ++i)
semget(k++, 1, IPC_CREAT | 0600);

total total max single max single
KEYS without with call without call with

1 3.5 4.9 µs 3.5 4.9
10 7.6 8.6 µs 3.7 4.7
32 16.2 15.9 µs 4.3 5.3
100 72.9 41.8 µs 3.7 4.7
1000 5,630.0 502.0 µs * *
10000 1,340,000.0 7,240.0 µs * *
31900 17,600,000.0 22,200.0 µs * *

*: unreliable measure: high variance

The duration for a lookup-only usage was obtained by the same loop once
the keys are present:

total total max single max single
KEYS without with call without call with

1 2.1 2.5 µs 2.1 2.5
10 4.5 4.8 µs 2.2 2.3
32 13.0 10.8 µs 2.3 2.8
100 82.9 25.1 µs * 2.3
1000 5,780.0 217.0 µs * *
10000 1,470,000.0 2,520.0 µs * *
31900 17,400,000.0 7,810.0 µs * *

Finally, executing each semget() in a new process gave, when still
summing only the durations of these syscalls:

creation:
total total
KEYS without with

1 3.7 5.0 µs
10 32.9 36.7 µs
32 125.0 109.0 µs
100 523.0 353.0 µs
1000 20,300.0 3,280.0 µs
10000 2,470,000.0 46,700.0 µs
31900 27,800,000.0 219,000.0 µs

lookup-only:
total total
KEYS without with

1 2.5 2.7 µs
10 25.4 24.4 µs
32 106.0 72.6 µs
100 591.0 352.0 µs
1000 22,400.0 2,250.0 µs
10000 2,510,000.0 25,700.0 µs
31900 28,200,000.0 115,000.0 µs

[1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop

Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix
Signed-off-by: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Reviewed-by: Marc Pardo <marc.pardo@supersonicimagine.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Cc: Marc Pardo <marc.pardo@supersonicimagine.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 0cfb6aee Fri Sep 08 17:17:55 MDT 2017 Guillaume Knispel <guillaume.knispel@supersonicimagine.com> ipc: optimize semget/shmget/msgget for lots of keys

ipc_findkey() used to scan all objects to look for the wanted key. This
is slow when using a high number of keys. This change adds an rhashtable
of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).

This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]

Other (more micro) benchmark results, by the author: On an i5 laptop, the
following loop executed right after a reboot took, without and with this
change:

for (int i = 0, k=0x424242; i < KEYS; ++i)
semget(k++, 1, IPC_CREAT | 0600);

total total max single max single
KEYS without with call without call with

1 3.5 4.9 µs 3.5 4.9
10 7.6 8.6 µs 3.7 4.7
32 16.2 15.9 µs 4.3 5.3
100 72.9 41.8 µs 3.7 4.7
1000 5,630.0 502.0 µs * *
10000 1,340,000.0 7,240.0 µs * *
31900 17,600,000.0 22,200.0 µs * *

*: unreliable measure: high variance

The duration for a lookup-only usage was obtained by the same loop once
the keys are present:

total total max single max single
KEYS without with call without call with

1 2.1 2.5 µs 2.1 2.5
10 4.5 4.8 µs 2.2 2.3
32 13.0 10.8 µs 2.3 2.8
100 82.9 25.1 µs * 2.3
1000 5,780.0 217.0 µs * *
10000 1,470,000.0 2,520.0 µs * *
31900 17,400,000.0 7,810.0 µs * *

Finally, executing each semget() in a new process gave, when still
summing only the durations of these syscalls:

creation:
total total
KEYS without with

1 3.7 5.0 µs
10 32.9 36.7 µs
32 125.0 109.0 µs
100 523.0 353.0 µs
1000 20,300.0 3,280.0 µs
10000 2,470,000.0 46,700.0 µs
31900 27,800,000.0 219,000.0 µs

lookup-only:
total total
KEYS without with

1 2.5 2.7 µs
10 25.4 24.4 µs
32 106.0 72.6 µs
100 591.0 352.0 µs
1000 22,400.0 2,250.0 µs
10000 2,510,000.0 25,700.0 µs
31900 28,200,000.0 115,000.0 µs

[1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop

Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix
Signed-off-by: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Reviewed-by: Marc Pardo <marc.pardo@supersonicimagine.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Cc: Marc Pardo <marc.pardo@supersonicimagine.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 5b825c3a Thu Feb 02 09:54:15 MST 2017 Ingo Molnar <mingo@kernel.org> sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff 5e4a0847 Fri Dec 14 08:55:36 MST 2012 Eric W. Biederman <ebiederm@xmission.com> userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
diff 5e4a0847 Fri Dec 14 08:55:36 MST 2012 Eric W. Biederman <ebiederm@xmission.com> userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
/linux-master/net/core/
H A Dnet_namespace.cdiff ea6932d7 Fri Jun 11 08:29:59 MDT 2021 Changbin Du <changbin.du@intel.com> net: make get_net_ns return error if NET_NS is disabled

There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled.
The reason is that nsfs tries to access ns->ops but the proc_ns_operations
is not implemented in this case.

[7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[7.670268] pgd = 32b54000
[7.670544] [00000010] *pgd=00000000
[7.671861] Internal error: Oops: 5 [#1] SMP ARM
[7.672315] Modules linked in:
[7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16
[7.673309] Hardware name: Generic DT based system
[7.673642] PC is at nsfs_evict+0x24/0x30
[7.674486] LR is at clear_inode+0x20/0x9c

The same to tun SIOCGSKNS command.

To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is
disabled. Meanwhile move it to right place net/core/net_namespace.c.

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Fixes: c62cce2caee5 ("net: add an ioctl to get a socket network namespace")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 5ba049a5 Mon Feb 12 14:26:13 MST 2018 Kirill Tkhai <ktkhai@virtuozzo.com> net: Cleanup in copy_net_ns()

Line up destructors actions in the revers order
to constructors. Next patches will add more actions,
and this will be comfortable, if there is the such
order.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 7866cc57 Tue May 30 03:38:12 MDT 2017 Florian Westphal <fw@strlen.de> netns: add and use net_ns_barrier

Quoting Joe Stringer:
If a user loads nf_conntrack_ftp, sends FTP traffic through a network
namespace, destroys that namespace then unloads the FTP helper module,
then the kernel will crash.

Events that lead to the crash:
1. conntrack is created with ftp helper in netns x
2. This netns is destroyed
3. netns destruction is scheduled
4. netns destruction wq starts, removes netns from global list
5. ftp helper is unloaded, which resets all helpers of the conntracks
via for_each_net()

but because netns is already gone from list the for_each_net() loop
doesn't include it, therefore all of these conntracks are unaffected.

6. helper module unload finishes
7. netns wq invokes destructor for rmmod'ed helper

CC: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
diff 5a950ad5 Thu Apr 16 07:17:35 MDT 2015 Wei Yongjun <yongjun_wei@trendmicro.com.cn> netns: remove duplicated include from net_namespace.c

Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 5e4a0847 Fri Dec 14 08:55:36 MST 2012 Eric W. Biederman <ebiederm@xmission.com> userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
diff 5e4a0847 Fri Dec 14 08:55:36 MST 2012 Eric W. Biederman <ebiederm@xmission.com> userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
diff b9f75f45 Fri Jun 20 23:16:51 MDT 2008 Eric W. Biederman <ebiederm@xmission.com> netns: Don't receive new packets in a dead network namespace.

Alexey Dobriyan <adobriyan@gmail.com> writes:
> Subject: ICMP sockets destruction vs ICMP packets oops

> After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
> icmp_reply() wants ICMP socket.
>
> Steps to reproduce:
>
> launch shell in new netns
> move real NIC to netns
> setup routing
> ping -i 0
> exit from shell
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff803fce17>] icmp_sk+0x17/0x30
> PGD 17f3cd067 PUD 17f3ce067 PMD 0
> Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 0
> Modules linked in: usblp usbcore
> Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
> RIP: 0010:[<ffffffff803fce17>] [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP: 0018:ffffffff8057fc30 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
> RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
> RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
> R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
> FS: 0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
> Stack: 0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
> ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
> 000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
> Call Trace:
> <IRQ> [<ffffffff803fcfe4>] icmp_reply+0x44/0x1e0
> [<ffffffff803d3a0a>] ? ip_route_input+0x23a/0x1360
> [<ffffffff803fd645>] icmp_echo+0x65/0x70
> [<ffffffff803fd300>] icmp_rcv+0x180/0x1b0
> [<ffffffff803d6d84>] ip_local_deliver+0xf4/0x1f0
> [<ffffffff803d71bb>] ip_rcv+0x33b/0x650
> [<ffffffff803bb16a>] netif_receive_skb+0x27a/0x340
> [<ffffffff803be57d>] process_backlog+0x9d/0x100
> [<ffffffff803bdd4d>] net_rx_action+0x18d/0x250
> [<ffffffff80237be5>] __do_softirq+0x75/0x100
> [<ffffffff8020c97c>] call_softirq+0x1c/0x30
> [<ffffffff8020f085>] do_softirq+0x65/0xa0
> [<ffffffff80237af7>] irq_exit+0x97/0xa0
> [<ffffffff8020f198>] do_IRQ+0xa8/0x130
> [<ffffffff80212ee0>] ? mwait_idle+0x0/0x60
> [<ffffffff8020bc46>] ret_from_intr+0x0/0xf
> <EOI> [<ffffffff80212f2c>] ? mwait_idle+0x4c/0x60
> [<ffffffff80212f23>] ? mwait_idle+0x43/0x60
> [<ffffffff8020a217>] ? cpu_idle+0x57/0xa0
> [<ffffffff8040f380>] ? rest_init+0x70/0x80
> Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
> 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 <48> 8b 04 c3 48 83 c4 08
> 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
> RIP [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP <ffffffff8057fc30>
> CR2: 0000000000000000
> ---[ end trace ea161157b76b33e8 ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!

Receiving packets while we are cleaning up a network namespace is a
racy proposition. It is possible when the packet arrives that we have
removed some but not all of the state we need to fully process it. We
have the choice of either playing wack-a-mole with the cleanup routines
or simply dropping packets when we don't have a network namespace to
handle them.

Since the check looks inexpensive in netif_receive_skb let's just
drop the incoming packets.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff b9f75f45 Fri Jun 20 23:16:51 MDT 2008 Eric W. Biederman <ebiederm@xmission.com> netns: Don't receive new packets in a dead network namespace.

Alexey Dobriyan <adobriyan@gmail.com> writes:
> Subject: ICMP sockets destruction vs ICMP packets oops

> After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
> icmp_reply() wants ICMP socket.
>
> Steps to reproduce:
>
> launch shell in new netns
> move real NIC to netns
> setup routing
> ping -i 0
> exit from shell
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff803fce17>] icmp_sk+0x17/0x30
> PGD 17f3cd067 PUD 17f3ce067 PMD 0
> Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 0
> Modules linked in: usblp usbcore
> Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
> RIP: 0010:[<ffffffff803fce17>] [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP: 0018:ffffffff8057fc30 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
> RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
> RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
> R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
> FS: 0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
> Stack: 0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
> ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
> 000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
> Call Trace:
> <IRQ> [<ffffffff803fcfe4>] icmp_reply+0x44/0x1e0
> [<ffffffff803d3a0a>] ? ip_route_input+0x23a/0x1360
> [<ffffffff803fd645>] icmp_echo+0x65/0x70
> [<ffffffff803fd300>] icmp_rcv+0x180/0x1b0
> [<ffffffff803d6d84>] ip_local_deliver+0xf4/0x1f0
> [<ffffffff803d71bb>] ip_rcv+0x33b/0x650
> [<ffffffff803bb16a>] netif_receive_skb+0x27a/0x340
> [<ffffffff803be57d>] process_backlog+0x9d/0x100
> [<ffffffff803bdd4d>] net_rx_action+0x18d/0x250
> [<ffffffff80237be5>] __do_softirq+0x75/0x100
> [<ffffffff8020c97c>] call_softirq+0x1c/0x30
> [<ffffffff8020f085>] do_softirq+0x65/0xa0
> [<ffffffff80237af7>] irq_exit+0x97/0xa0
> [<ffffffff8020f198>] do_IRQ+0xa8/0x130
> [<ffffffff80212ee0>] ? mwait_idle+0x0/0x60
> [<ffffffff8020bc46>] ret_from_intr+0x0/0xf
> <EOI> [<ffffffff80212f2c>] ? mwait_idle+0x4c/0x60
> [<ffffffff80212f23>] ? mwait_idle+0x43/0x60
> [<ffffffff8020a217>] ? cpu_idle+0x57/0xa0
> [<ffffffff8040f380>] ? rest_init+0x70/0x80
> Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
> 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 <48> 8b 04 c3 48 83 c4 08
> 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
> RIP [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP <ffffffff8057fc30>
> CR2: 0000000000000000
> ---[ end trace ea161157b76b33e8 ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!

Receiving packets while we are cleaning up a network namespace is a
racy proposition. It is possible when the packet arrives that we have
removed some but not all of the state we need to fully process it. We
have the choice of either playing wack-a-mole with the cleanup routines
or simply dropping packets when we don't have a network namespace to
handle them.

Since the check looks inexpensive in netif_receive_skb let's just
drop the incoming packets.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff b9f75f45 Fri Jun 20 23:16:51 MDT 2008 Eric W. Biederman <ebiederm@xmission.com> netns: Don't receive new packets in a dead network namespace.

Alexey Dobriyan <adobriyan@gmail.com> writes:
> Subject: ICMP sockets destruction vs ICMP packets oops

> After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
> icmp_reply() wants ICMP socket.
>
> Steps to reproduce:
>
> launch shell in new netns
> move real NIC to netns
> setup routing
> ping -i 0
> exit from shell
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff803fce17>] icmp_sk+0x17/0x30
> PGD 17f3cd067 PUD 17f3ce067 PMD 0
> Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 0
> Modules linked in: usblp usbcore
> Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
> RIP: 0010:[<ffffffff803fce17>] [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP: 0018:ffffffff8057fc30 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
> RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
> RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
> R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
> FS: 0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
> Stack: 0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
> ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
> 000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
> Call Trace:
> <IRQ> [<ffffffff803fcfe4>] icmp_reply+0x44/0x1e0
> [<ffffffff803d3a0a>] ? ip_route_input+0x23a/0x1360
> [<ffffffff803fd645>] icmp_echo+0x65/0x70
> [<ffffffff803fd300>] icmp_rcv+0x180/0x1b0
> [<ffffffff803d6d84>] ip_local_deliver+0xf4/0x1f0
> [<ffffffff803d71bb>] ip_rcv+0x33b/0x650
> [<ffffffff803bb16a>] netif_receive_skb+0x27a/0x340
> [<ffffffff803be57d>] process_backlog+0x9d/0x100
> [<ffffffff803bdd4d>] net_rx_action+0x18d/0x250
> [<ffffffff80237be5>] __do_softirq+0x75/0x100
> [<ffffffff8020c97c>] call_softirq+0x1c/0x30
> [<ffffffff8020f085>] do_softirq+0x65/0xa0
> [<ffffffff80237af7>] irq_exit+0x97/0xa0
> [<ffffffff8020f198>] do_IRQ+0xa8/0x130
> [<ffffffff80212ee0>] ? mwait_idle+0x0/0x60
> [<ffffffff8020bc46>] ret_from_intr+0x0/0xf
> <EOI> [<ffffffff80212f2c>] ? mwait_idle+0x4c/0x60
> [<ffffffff80212f23>] ? mwait_idle+0x43/0x60
> [<ffffffff8020a217>] ? cpu_idle+0x57/0xa0
> [<ffffffff8040f380>] ? rest_init+0x70/0x80
> Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
> 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 <48> 8b 04 c3 48 83 c4 08
> 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
> RIP [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP <ffffffff8057fc30>
> CR2: 0000000000000000
> ---[ end trace ea161157b76b33e8 ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!

Receiving packets while we are cleaning up a network namespace is a
racy proposition. It is possible when the packet arrives that we have
removed some but not all of the state we need to fully process it. We
have the choice of either playing wack-a-mole with the cleanup routines
or simply dropping packets when we don't have a network namespace to
handle them.

Since the check looks inexpensive in netif_receive_skb let's just
drop the incoming packets.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff b9f75f45 Fri Jun 20 23:16:51 MDT 2008 Eric W. Biederman <ebiederm@xmission.com> netns: Don't receive new packets in a dead network namespace.

Alexey Dobriyan <adobriyan@gmail.com> writes:
> Subject: ICMP sockets destruction vs ICMP packets oops

> After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
> icmp_reply() wants ICMP socket.
>
> Steps to reproduce:
>
> launch shell in new netns
> move real NIC to netns
> setup routing
> ping -i 0
> exit from shell
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff803fce17>] icmp_sk+0x17/0x30
> PGD 17f3cd067 PUD 17f3ce067 PMD 0
> Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 0
> Modules linked in: usblp usbcore
> Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
> RIP: 0010:[<ffffffff803fce17>] [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP: 0018:ffffffff8057fc30 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
> RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
> RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
> R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
> FS: 0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
> Stack: 0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
> ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
> 000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
> Call Trace:
> <IRQ> [<ffffffff803fcfe4>] icmp_reply+0x44/0x1e0
> [<ffffffff803d3a0a>] ? ip_route_input+0x23a/0x1360
> [<ffffffff803fd645>] icmp_echo+0x65/0x70
> [<ffffffff803fd300>] icmp_rcv+0x180/0x1b0
> [<ffffffff803d6d84>] ip_local_deliver+0xf4/0x1f0
> [<ffffffff803d71bb>] ip_rcv+0x33b/0x650
> [<ffffffff803bb16a>] netif_receive_skb+0x27a/0x340
> [<ffffffff803be57d>] process_backlog+0x9d/0x100
> [<ffffffff803bdd4d>] net_rx_action+0x18d/0x250
> [<ffffffff80237be5>] __do_softirq+0x75/0x100
> [<ffffffff8020c97c>] call_softirq+0x1c/0x30
> [<ffffffff8020f085>] do_softirq+0x65/0xa0
> [<ffffffff80237af7>] irq_exit+0x97/0xa0
> [<ffffffff8020f198>] do_IRQ+0xa8/0x130
> [<ffffffff80212ee0>] ? mwait_idle+0x0/0x60
> [<ffffffff8020bc46>] ret_from_intr+0x0/0xf
> <EOI> [<ffffffff80212f2c>] ? mwait_idle+0x4c/0x60
> [<ffffffff80212f23>] ? mwait_idle+0x43/0x60
> [<ffffffff8020a217>] ? cpu_idle+0x57/0xa0
> [<ffffffff8040f380>] ? rest_init+0x70/0x80
> Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
> 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 <48> 8b 04 c3 48 83 c4 08
> 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
> RIP [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP <ffffffff8057fc30>
> CR2: 0000000000000000
> ---[ end trace ea161157b76b33e8 ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!

Receiving packets while we are cleaning up a network namespace is a
racy proposition. It is possible when the packet arrives that we have
removed some but not all of the state we need to fully process it. We
have the choice of either playing wack-a-mole with the cleanup routines
or simply dropping packets when we don't have a network namespace to
handle them.

Since the check looks inexpensive in netif_receive_skb let's just
drop the incoming packets.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff b9f75f45 Fri Jun 20 23:16:51 MDT 2008 Eric W. Biederman <ebiederm@xmission.com> netns: Don't receive new packets in a dead network namespace.

Alexey Dobriyan <adobriyan@gmail.com> writes:
> Subject: ICMP sockets destruction vs ICMP packets oops

> After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
> icmp_reply() wants ICMP socket.
>
> Steps to reproduce:
>
> launch shell in new netns
> move real NIC to netns
> setup routing
> ping -i 0
> exit from shell
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff803fce17>] icmp_sk+0x17/0x30
> PGD 17f3cd067 PUD 17f3ce067 PMD 0
> Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 0
> Modules linked in: usblp usbcore
> Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
> RIP: 0010:[<ffffffff803fce17>] [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP: 0018:ffffffff8057fc30 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
> RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
> RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
> R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
> FS: 0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
> Stack: 0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
> ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
> 000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
> Call Trace:
> <IRQ> [<ffffffff803fcfe4>] icmp_reply+0x44/0x1e0
> [<ffffffff803d3a0a>] ? ip_route_input+0x23a/0x1360
> [<ffffffff803fd645>] icmp_echo+0x65/0x70
> [<ffffffff803fd300>] icmp_rcv+0x180/0x1b0
> [<ffffffff803d6d84>] ip_local_deliver+0xf4/0x1f0
> [<ffffffff803d71bb>] ip_rcv+0x33b/0x650
> [<ffffffff803bb16a>] netif_receive_skb+0x27a/0x340
> [<ffffffff803be57d>] process_backlog+0x9d/0x100
> [<ffffffff803bdd4d>] net_rx_action+0x18d/0x250
> [<ffffffff80237be5>] __do_softirq+0x75/0x100
> [<ffffffff8020c97c>] call_softirq+0x1c/0x30
> [<ffffffff8020f085>] do_softirq+0x65/0xa0
> [<ffffffff80237af7>] irq_exit+0x97/0xa0
> [<ffffffff8020f198>] do_IRQ+0xa8/0x130
> [<ffffffff80212ee0>] ? mwait_idle+0x0/0x60
> [<ffffffff8020bc46>] ret_from_intr+0x0/0xf
> <EOI> [<ffffffff80212f2c>] ? mwait_idle+0x4c/0x60
> [<ffffffff80212f23>] ? mwait_idle+0x43/0x60
> [<ffffffff8020a217>] ? cpu_idle+0x57/0xa0
> [<ffffffff8040f380>] ? rest_init+0x70/0x80
> Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
> 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 <48> 8b 04 c3 48 83 c4 08
> 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
> RIP [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP <ffffffff8057fc30>
> CR2: 0000000000000000
> ---[ end trace ea161157b76b33e8 ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!

Receiving packets while we are cleaning up a network namespace is a
racy proposition. It is possible when the packet arrives that we have
removed some but not all of the state we need to fully process it. We
have the choice of either playing wack-a-mole with the cleanup routines
or simply dropping packets when we don't have a network namespace to
handle them.

Since the check looks inexpensive in netif_receive_skb let's just
drop the incoming packets.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/fs/
H A Dnamespace.cdiff b4c2bea8 Wed Oct 25 08:02:03 MDT 2023 Miklos Szeredi <mszeredi@redhat.com> add listmount(2) syscall

Add way to query the children of a particular mount. This is a more
flexible way to iterate the mount tree than having to parse
/proc/self/mountinfo.

Lookup the mount by the new 64bit mount ID. If a mount needs to be
queried based on path, then statx(2) can be used to first query the
mount ID belonging to the path.

Return an array of new (64bit) mount ID's. Without privileges only
mounts are listed which are reachable from the task's root.

Folded into this patch are several later improvements. Keeping them
separate would make the history pointlessly confusing:

* Recursive listing of mounts is the default now (cf. [1]).
* Remove explicit LISTMOUNT_UNREACHABLE flag (cf. [1]) and fail if mount
is unreachable from current root. This also makes permission checking
consistent with statmount() (cf. [3]).
* Start listing mounts in unique mount ID order (cf. [2]) to allow
continuing listmount() from a midpoint.
* Allow to continue listmount(). The @request_mask parameter is renamed
and to @param to be usable by both statmount() and listmount().
If @param is set to a mount id then listmount() will continue listing
mounts from that id on. This allows listing mounts in multiple
listmount invocations without having to resize the buffer. If @param
is zero then the listing starts from the beginning (cf. [4]).
* Don't return EOVERFLOW, instead return the buffer size which allows to
detect a full buffer as well (cf. [4]).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-6-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/20231128160337.29094-2-mszeredi@redhat.com [1] (folded)
Link: https://lore.kernel.org/r/20231128160337.29094-3-mszeredi@redhat.com [2] (folded)
Link: https://lore.kernel.org/r/20231128160337.29094-4-mszeredi@redhat.com [3] (folded)
Link: https://lore.kernel.org/r/20231128160337.29094-5-mszeredi@redhat.com [4] (folded)
[Christian Brauner <brauner@kernel.org>: various smaller fixes]
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 46eae99e Wed Oct 25 08:02:02 MDT 2023 Miklos Szeredi <mszeredi@redhat.com> add statmount(2) syscall

Add a way to query attributes of a single mount instead of having to parse
the complete /proc/$PID/mountinfo, which might be huge.

Lookup the mount the new 64bit mount ID. If a mount needs to be queried
based on path, then statx(2) can be used to first query the mount ID
belonging to the path.

Design is based on a suggestion by Linus:

"So I'd suggest something that is very much like "statfsat()", which gets
a buffer and a length, and returns an extended "struct statfs" *AND*
just a string description at the end."

The interface closely mimics that of statx.

Handle ASCII attributes by appending after the end of the structure (as per
above suggestion). Pointers to strings are stored in u64 members to make
the structure the same regardless of pointer size. Strings are nul
terminated.

Link: https://lore.kernel.org/all/CAHk-=wh5YifP7hzKSbwJj94+DZ2czjrZsczy6GBimiogZws=rg@mail.gmail.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-5-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
[Christian Brauner <brauner@kernel.org>: various minor changes]
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 6ac39281 Wed May 03 05:18:42 MDT 2023 Christian Brauner <brauner@kernel.org> fs: allow to mount beneath top mount

Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:

(1) When a service or container is started in a new mount namespace and
pivot_root()s into its new rootfs. The way this is done is by
mounting the new rootfs beneath the old rootfs:

fd_newroot = open("/var/lib/machines/fedora", ...);
fd_oldroot = open("/", ...);
fchdir(fd_newroot);
pivot_root(".", ".");

After the pivot_root(".", ".") call the new rootfs is mounted
beneath the old rootfs which can then be unmounted to reveal the
underlying mount:

fchdir(fd_oldroot);
umount2(".", MNT_DETACH);

Since pivot_root() moves the caller into a new rootfs no mounts must
be propagated out of the new rootfs as a consequence of the
pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
mounted on the same dentry.

The easiest example for this is to create a new mount namespace. The
following commands will create a mount namespace where the rootfs
mount / will be a slave to the peer group of the host rootfs /
mount's peer group. IOW, it will receive propagation from the host:

mount --make-shared /
unshare --mount --propagation=slave

Now a new mount on the /mnt dentry in that mount namespace is
created. (As it can be confusing it should be spelled out that the
tmpfs mount on the /mnt dentry that was just created doesn't
propagate back to the host because the rootfs mount / of the mount
namespace isn't a peer of the host rootfs.):

mount -t tmpfs tmpfs /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt tmpfs tmpfs

Now another terminal in the host mount namespace can observe that
the mount indeed hasn't propagated back to into the host mount
namespace. A new mount can now be created on top of the /mnt dentry
with the rootfs mount / as its parent:

mount --bind /opt /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 shared:1

The mount namespace that was created earlier can now observe that
the bind mount created on the host has propagated into it:

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 master:1
└─/mnt tmpfs tmpfs

But instead of having been mounted on top of the tmpfs mount at the
/mnt dentry the /opt mount has been mounted on top of the rootfs
mount at the /mnt dentry. And the tmpfs mount has been remounted on
top of the propagated /opt mount at the /opt dentry. So in other
words, the propagated mount has been mounted beneath the preexisting
mount in that mount namespace.

Mount namespaces make this easy to illustrate but it's also easy to
mount beneath an existing mount in the same mount namespace
(The following example assumes a shared rootfs mount / with peer
group id 1):

mount --bind /opt /opt

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/opt] ext4 188 29 shared:1

If another mount is mounted on top of the /opt mount at the /opt
dentry:

mount --bind /tmp /opt

The following clunky mount tree will result:

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/tmp] ext4 405 29 shared:1
└─/opt /dev/sda2[/opt] ext4 188 405 shared:1
└─/opt /dev/sda2[/tmp] ext4 404 188 shared:1

The /tmp mount is mounted beneath the /opt mount and another copy is
mounted on top of the /opt mount. This happens because the rootfs /
and the /opt mount are shared mounts in the same peer group.

When the new /tmp mount is supposed to be mounted at the /opt dentry
then the /tmp mount first propagates to the root mount at the /opt
dentry. But there already is the /opt mount mounted at the /opt
dentry. So the old /opt mount at the /opt dentry will be mounted on
top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
is /opt which is what shows up as /opt under SOURCE). So again, a
mount will be mounted beneath a preexisting mount.

(Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
shared rootfs is a good example of what could be referred to as
mount explosion.)

The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.

In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.

This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.

These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.

More common use-cases will just be things like:

mount -t btrfs /dev/sdA /mnt
mount -t xfs /dev/sdB --beneath /mnt
umount /mnt

after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:

* Mounts cannot be mounted beneath the rootfs. This restriction
encompasses the rootfs but also chroots via chroot() and pivot_root().
To mount a mount beneath the rootfs or a chroot, pivot_root() can be
used as illustrated above.
* The source mount must be a private mount to force the kernel to
allocate a new, unused peer group id. This isn't a required
restriction but a voluntary one. It avoids repeating a semantical
quirk that already exists today. If bind mounts which already have a
peer group id are inserted into mount trees that have the same peer
group id this can cause a lot of mount propagation events to be
generated (For example, consider running mount --bind /opt /opt in a
loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
need to be able to unmount the top mount themselves.
This also avoids a good deal of additional complexity. The umount
would have to be propagated which would be another rather expensive
operation. So namespace_lock() and lock_mount_hash() would potentially
have to be held for a long time for both a mount and umount
propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
parent mount and the destination mount's mountpoint on the parent
mount. Of course, if the parent of the destination mount and the
destination mount are shared mounts in the same peer group and the
mountpoint of the new mount to be mounted is a subdir of their
->mnt_root then both will receive a mount of /opt. That's probably
easier to understand with an example. Assuming a standard shared
rootfs /:

mount --bind /opt /opt
mount --bind /tmp /opt

will cause the same mount tree as:

mount --bind /opt /opt
mount --beneath /tmp /opt

because both / and /opt are shared mounts/peers in the same peer
group and the /opt dentry is a subdirectory of both the parent's and
the child's ->mnt_root. If a mount tree like that is created it almost
always is an accident or abuse of mount propagation. Realistically
what most people probably mean in this scenarios is:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt

This forces the allocation of a new separate peer group for the /opt
mount. Aferwards a mount --bind or mount --beneath actually makes
sense as the / and /opt mount belong to different peer groups. Before
that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
(1) the @mnt_from has been overmounted in between path resolution and
acquiring @namespace_sem when locking @mnt_to. This avoids the
proliferation of shadow mounts.
(2) if @to_mnt is moved to a different mountpoint while acquiring
@namespace_sem to lock @to_mnt.
(3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
@to_mnt.
(4) if the parent of the target mount propagates to the target mount
at the same mountpoint.
This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
whole purpose of mounting @mnt_from beneath @mnt_to.
(5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
the same mountpoint.
If @mnt_to->mnt_parent propagates to @mnt_from this would mean
propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
@mnt_from would be mounted on top of @mnt_to->mnt_parent and
@mnt_to would be unmounted from @mnt->mnt_parent and remounted on
@mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
would ultimately be remounted on top of @c. Afterwards, @mnt_from
would be covered by a copy @c of @mnt_from and @c would be covered
by @mnt_from itself. This defeats the whole purpose of mounting
@mnt_from beneath @mnt_to.
Cases (1) to (3) are required as they deal with races that would cause
bugs or unexpected behavior for users. Cases (4) and (5) refuse
semantical quirks that would not be a bug but would cause weird mount
trees to be created. While they can already be created via other means
(mount --bind /opt /opt x n) there's no reason to repeat past mistakes
in new features.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 6ac39281 Wed May 03 05:18:42 MDT 2023 Christian Brauner <brauner@kernel.org> fs: allow to mount beneath top mount

Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:

(1) When a service or container is started in a new mount namespace and
pivot_root()s into its new rootfs. The way this is done is by
mounting the new rootfs beneath the old rootfs:

fd_newroot = open("/var/lib/machines/fedora", ...);
fd_oldroot = open("/", ...);
fchdir(fd_newroot);
pivot_root(".", ".");

After the pivot_root(".", ".") call the new rootfs is mounted
beneath the old rootfs which can then be unmounted to reveal the
underlying mount:

fchdir(fd_oldroot);
umount2(".", MNT_DETACH);

Since pivot_root() moves the caller into a new rootfs no mounts must
be propagated out of the new rootfs as a consequence of the
pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
mounted on the same dentry.

The easiest example for this is to create a new mount namespace. The
following commands will create a mount namespace where the rootfs
mount / will be a slave to the peer group of the host rootfs /
mount's peer group. IOW, it will receive propagation from the host:

mount --make-shared /
unshare --mount --propagation=slave

Now a new mount on the /mnt dentry in that mount namespace is
created. (As it can be confusing it should be spelled out that the
tmpfs mount on the /mnt dentry that was just created doesn't
propagate back to the host because the rootfs mount / of the mount
namespace isn't a peer of the host rootfs.):

mount -t tmpfs tmpfs /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt tmpfs tmpfs

Now another terminal in the host mount namespace can observe that
the mount indeed hasn't propagated back to into the host mount
namespace. A new mount can now be created on top of the /mnt dentry
with the rootfs mount / as its parent:

mount --bind /opt /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 shared:1

The mount namespace that was created earlier can now observe that
the bind mount created on the host has propagated into it:

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 master:1
└─/mnt tmpfs tmpfs

But instead of having been mounted on top of the tmpfs mount at the
/mnt dentry the /opt mount has been mounted on top of the rootfs
mount at the /mnt dentry. And the tmpfs mount has been remounted on
top of the propagated /opt mount at the /opt dentry. So in other
words, the propagated mount has been mounted beneath the preexisting
mount in that mount namespace.

Mount namespaces make this easy to illustrate but it's also easy to
mount beneath an existing mount in the same mount namespace
(The following example assumes a shared rootfs mount / with peer
group id 1):

mount --bind /opt /opt

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/opt] ext4 188 29 shared:1

If another mount is mounted on top of the /opt mount at the /opt
dentry:

mount --bind /tmp /opt

The following clunky mount tree will result:

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/tmp] ext4 405 29 shared:1
└─/opt /dev/sda2[/opt] ext4 188 405 shared:1
└─/opt /dev/sda2[/tmp] ext4 404 188 shared:1

The /tmp mount is mounted beneath the /opt mount and another copy is
mounted on top of the /opt mount. This happens because the rootfs /
and the /opt mount are shared mounts in the same peer group.

When the new /tmp mount is supposed to be mounted at the /opt dentry
then the /tmp mount first propagates to the root mount at the /opt
dentry. But there already is the /opt mount mounted at the /opt
dentry. So the old /opt mount at the /opt dentry will be mounted on
top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
is /opt which is what shows up as /opt under SOURCE). So again, a
mount will be mounted beneath a preexisting mount.

(Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
shared rootfs is a good example of what could be referred to as
mount explosion.)

The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.

In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.

This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.

These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.

More common use-cases will just be things like:

mount -t btrfs /dev/sdA /mnt
mount -t xfs /dev/sdB --beneath /mnt
umount /mnt

after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:

* Mounts cannot be mounted beneath the rootfs. This restriction
encompasses the rootfs but also chroots via chroot() and pivot_root().
To mount a mount beneath the rootfs or a chroot, pivot_root() can be
used as illustrated above.
* The source mount must be a private mount to force the kernel to
allocate a new, unused peer group id. This isn't a required
restriction but a voluntary one. It avoids repeating a semantical
quirk that already exists today. If bind mounts which already have a
peer group id are inserted into mount trees that have the same peer
group id this can cause a lot of mount propagation events to be
generated (For example, consider running mount --bind /opt /opt in a
loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
need to be able to unmount the top mount themselves.
This also avoids a good deal of additional complexity. The umount
would have to be propagated which would be another rather expensive
operation. So namespace_lock() and lock_mount_hash() would potentially
have to be held for a long time for both a mount and umount
propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
parent mount and the destination mount's mountpoint on the parent
mount. Of course, if the parent of the destination mount and the
destination mount are shared mounts in the same peer group and the
mountpoint of the new mount to be mounted is a subdir of their
->mnt_root then both will receive a mount of /opt. That's probably
easier to understand with an example. Assuming a standard shared
rootfs /:

mount --bind /opt /opt
mount --bind /tmp /opt

will cause the same mount tree as:

mount --bind /opt /opt
mount --beneath /tmp /opt

because both / and /opt are shared mounts/peers in the same peer
group and the /opt dentry is a subdirectory of both the parent's and
the child's ->mnt_root. If a mount tree like that is created it almost
always is an accident or abuse of mount propagation. Realistically
what most people probably mean in this scenarios is:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt

This forces the allocation of a new separate peer group for the /opt
mount. Aferwards a mount --bind or mount --beneath actually makes
sense as the / and /opt mount belong to different peer groups. Before
that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
(1) the @mnt_from has been overmounted in between path resolution and
acquiring @namespace_sem when locking @mnt_to. This avoids the
proliferation of shadow mounts.
(2) if @to_mnt is moved to a different mountpoint while acquiring
@namespace_sem to lock @to_mnt.
(3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
@to_mnt.
(4) if the parent of the target mount propagates to the target mount
at the same mountpoint.
This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
whole purpose of mounting @mnt_from beneath @mnt_to.
(5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
the same mountpoint.
If @mnt_to->mnt_parent propagates to @mnt_from this would mean
propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
@mnt_from would be mounted on top of @mnt_to->mnt_parent and
@mnt_to would be unmounted from @mnt->mnt_parent and remounted on
@mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
would ultimately be remounted on top of @c. Afterwards, @mnt_from
would be covered by a copy @c of @mnt_from and @c would be covered
by @mnt_from itself. This defeats the whole purpose of mounting
@mnt_from beneath @mnt_to.
Cases (1) to (3) are required as they deal with races that would cause
bugs or unexpected behavior for users. Cases (4) and (5) refuse
semantical quirks that would not be a bug but would cause weird mount
trees to be created. While they can already be created via other means
(mount --bind /opt /opt x n) there's no reason to repeat past mistakes
in new features.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 6ac39281 Wed May 03 05:18:42 MDT 2023 Christian Brauner <brauner@kernel.org> fs: allow to mount beneath top mount

Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:

(1) When a service or container is started in a new mount namespace and
pivot_root()s into its new rootfs. The way this is done is by
mounting the new rootfs beneath the old rootfs:

fd_newroot = open("/var/lib/machines/fedora", ...);
fd_oldroot = open("/", ...);
fchdir(fd_newroot);
pivot_root(".", ".");

After the pivot_root(".", ".") call the new rootfs is mounted
beneath the old rootfs which can then be unmounted to reveal the
underlying mount:

fchdir(fd_oldroot);
umount2(".", MNT_DETACH);

Since pivot_root() moves the caller into a new rootfs no mounts must
be propagated out of the new rootfs as a consequence of the
pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
mounted on the same dentry.

The easiest example for this is to create a new mount namespace. The
following commands will create a mount namespace where the rootfs
mount / will be a slave to the peer group of the host rootfs /
mount's peer group. IOW, it will receive propagation from the host:

mount --make-shared /
unshare --mount --propagation=slave

Now a new mount on the /mnt dentry in that mount namespace is
created. (As it can be confusing it should be spelled out that the
tmpfs mount on the /mnt dentry that was just created doesn't
propagate back to the host because the rootfs mount / of the mount
namespace isn't a peer of the host rootfs.):

mount -t tmpfs tmpfs /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt tmpfs tmpfs

Now another terminal in the host mount namespace can observe that
the mount indeed hasn't propagated back to into the host mount
namespace. A new mount can now be created on top of the /mnt dentry
with the rootfs mount / as its parent:

mount --bind /opt /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 shared:1

The mount namespace that was created earlier can now observe that
the bind mount created on the host has propagated into it:

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 master:1
└─/mnt tmpfs tmpfs

But instead of having been mounted on top of the tmpfs mount at the
/mnt dentry the /opt mount has been mounted on top of the rootfs
mount at the /mnt dentry. And the tmpfs mount has been remounted on
top of the propagated /opt mount at the /opt dentry. So in other
words, the propagated mount has been mounted beneath the preexisting
mount in that mount namespace.

Mount namespaces make this easy to illustrate but it's also easy to
mount beneath an existing mount in the same mount namespace
(The following example assumes a shared rootfs mount / with peer
group id 1):

mount --bind /opt /opt

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/opt] ext4 188 29 shared:1

If another mount is mounted on top of the /opt mount at the /opt
dentry:

mount --bind /tmp /opt

The following clunky mount tree will result:

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/tmp] ext4 405 29 shared:1
└─/opt /dev/sda2[/opt] ext4 188 405 shared:1
└─/opt /dev/sda2[/tmp] ext4 404 188 shared:1

The /tmp mount is mounted beneath the /opt mount and another copy is
mounted on top of the /opt mount. This happens because the rootfs /
and the /opt mount are shared mounts in the same peer group.

When the new /tmp mount is supposed to be mounted at the /opt dentry
then the /tmp mount first propagates to the root mount at the /opt
dentry. But there already is the /opt mount mounted at the /opt
dentry. So the old /opt mount at the /opt dentry will be mounted on
top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
is /opt which is what shows up as /opt under SOURCE). So again, a
mount will be mounted beneath a preexisting mount.

(Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
shared rootfs is a good example of what could be referred to as
mount explosion.)

The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.

In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.

This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.

These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.

More common use-cases will just be things like:

mount -t btrfs /dev/sdA /mnt
mount -t xfs /dev/sdB --beneath /mnt
umount /mnt

after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:

* Mounts cannot be mounted beneath the rootfs. This restriction
encompasses the rootfs but also chroots via chroot() and pivot_root().
To mount a mount beneath the rootfs or a chroot, pivot_root() can be
used as illustrated above.
* The source mount must be a private mount to force the kernel to
allocate a new, unused peer group id. This isn't a required
restriction but a voluntary one. It avoids repeating a semantical
quirk that already exists today. If bind mounts which already have a
peer group id are inserted into mount trees that have the same peer
group id this can cause a lot of mount propagation events to be
generated (For example, consider running mount --bind /opt /opt in a
loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
need to be able to unmount the top mount themselves.
This also avoids a good deal of additional complexity. The umount
would have to be propagated which would be another rather expensive
operation. So namespace_lock() and lock_mount_hash() would potentially
have to be held for a long time for both a mount and umount
propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
parent mount and the destination mount's mountpoint on the parent
mount. Of course, if the parent of the destination mount and the
destination mount are shared mounts in the same peer group and the
mountpoint of the new mount to be mounted is a subdir of their
->mnt_root then both will receive a mount of /opt. That's probably
easier to understand with an example. Assuming a standard shared
rootfs /:

mount --bind /opt /opt
mount --bind /tmp /opt

will cause the same mount tree as:

mount --bind /opt /opt
mount --beneath /tmp /opt

because both / and /opt are shared mounts/peers in the same peer
group and the /opt dentry is a subdirectory of both the parent's and
the child's ->mnt_root. If a mount tree like that is created it almost
always is an accident or abuse of mount propagation. Realistically
what most people probably mean in this scenarios is:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt

This forces the allocation of a new separate peer group for the /opt
mount. Aferwards a mount --bind or mount --beneath actually makes
sense as the / and /opt mount belong to different peer groups. Before
that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
(1) the @mnt_from has been overmounted in between path resolution and
acquiring @namespace_sem when locking @mnt_to. This avoids the
proliferation of shadow mounts.
(2) if @to_mnt is moved to a different mountpoint while acquiring
@namespace_sem to lock @to_mnt.
(3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
@to_mnt.
(4) if the parent of the target mount propagates to the target mount
at the same mountpoint.
This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
whole purpose of mounting @mnt_from beneath @mnt_to.
(5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
the same mountpoint.
If @mnt_to->mnt_parent propagates to @mnt_from this would mean
propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
@mnt_from would be mounted on top of @mnt_to->mnt_parent and
@mnt_to would be unmounted from @mnt->mnt_parent and remounted on
@mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
would ultimately be remounted on top of @c. Afterwards, @mnt_from
would be covered by a copy @c of @mnt_from and @c would be covered
by @mnt_from itself. This defeats the whole purpose of mounting
@mnt_from beneath @mnt_to.
Cases (1) to (3) are required as they deal with races that would cause
bugs or unexpected behavior for users. Cases (4) and (5) refuse
semantical quirks that would not be a bug but would cause weird mount
trees to be created. While they can already be created via other means
(mount --bind /opt /opt x n) there's no reason to repeat past mistakes
in new features.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 6ac39281 Wed May 03 05:18:42 MDT 2023 Christian Brauner <brauner@kernel.org> fs: allow to mount beneath top mount

Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:

(1) When a service or container is started in a new mount namespace and
pivot_root()s into its new rootfs. The way this is done is by
mounting the new rootfs beneath the old rootfs:

fd_newroot = open("/var/lib/machines/fedora", ...);
fd_oldroot = open("/", ...);
fchdir(fd_newroot);
pivot_root(".", ".");

After the pivot_root(".", ".") call the new rootfs is mounted
beneath the old rootfs which can then be unmounted to reveal the
underlying mount:

fchdir(fd_oldroot);
umount2(".", MNT_DETACH);

Since pivot_root() moves the caller into a new rootfs no mounts must
be propagated out of the new rootfs as a consequence of the
pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
mounted on the same dentry.

The easiest example for this is to create a new mount namespace. The
following commands will create a mount namespace where the rootfs
mount / will be a slave to the peer group of the host rootfs /
mount's peer group. IOW, it will receive propagation from the host:

mount --make-shared /
unshare --mount --propagation=slave

Now a new mount on the /mnt dentry in that mount namespace is
created. (As it can be confusing it should be spelled out that the
tmpfs mount on the /mnt dentry that was just created doesn't
propagate back to the host because the rootfs mount / of the mount
namespace isn't a peer of the host rootfs.):

mount -t tmpfs tmpfs /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt tmpfs tmpfs

Now another terminal in the host mount namespace can observe that
the mount indeed hasn't propagated back to into the host mount
namespace. A new mount can now be created on top of the /mnt dentry
with the rootfs mount / as its parent:

mount --bind /opt /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 shared:1

The mount namespace that was created earlier can now observe that
the bind mount created on the host has propagated into it:

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 master:1
└─/mnt tmpfs tmpfs

But instead of having been mounted on top of the tmpfs mount at the
/mnt dentry the /opt mount has been mounted on top of the rootfs
mount at the /mnt dentry. And the tmpfs mount has been remounted on
top of the propagated /opt mount at the /opt dentry. So in other
words, the propagated mount has been mounted beneath the preexisting
mount in that mount namespace.

Mount namespaces make this easy to illustrate but it's also easy to
mount beneath an existing mount in the same mount namespace
(The following example assumes a shared rootfs mount / with peer
group id 1):

mount --bind /opt /opt

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/opt] ext4 188 29 shared:1

If another mount is mounted on top of the /opt mount at the /opt
dentry:

mount --bind /tmp /opt

The following clunky mount tree will result:

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/tmp] ext4 405 29 shared:1
└─/opt /dev/sda2[/opt] ext4 188 405 shared:1
└─/opt /dev/sda2[/tmp] ext4 404 188 shared:1

The /tmp mount is mounted beneath the /opt mount and another copy is
mounted on top of the /opt mount. This happens because the rootfs /
and the /opt mount are shared mounts in the same peer group.

When the new /tmp mount is supposed to be mounted at the /opt dentry
then the /tmp mount first propagates to the root mount at the /opt
dentry. But there already is the /opt mount mounted at the /opt
dentry. So the old /opt mount at the /opt dentry will be mounted on
top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
is /opt which is what shows up as /opt under SOURCE). So again, a
mount will be mounted beneath a preexisting mount.

(Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
shared rootfs is a good example of what could be referred to as
mount explosion.)

The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.

In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.

This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.

These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.

More common use-cases will just be things like:

mount -t btrfs /dev/sdA /mnt
mount -t xfs /dev/sdB --beneath /mnt
umount /mnt

after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:

* Mounts cannot be mounted beneath the rootfs. This restriction
encompasses the rootfs but also chroots via chroot() and pivot_root().
To mount a mount beneath the rootfs or a chroot, pivot_root() can be
used as illustrated above.
* The source mount must be a private mount to force the kernel to
allocate a new, unused peer group id. This isn't a required
restriction but a voluntary one. It avoids repeating a semantical
quirk that already exists today. If bind mounts which already have a
peer group id are inserted into mount trees that have the same peer
group id this can cause a lot of mount propagation events to be
generated (For example, consider running mount --bind /opt /opt in a
loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
need to be able to unmount the top mount themselves.
This also avoids a good deal of additional complexity. The umount
would have to be propagated which would be another rather expensive
operation. So namespace_lock() and lock_mount_hash() would potentially
have to be held for a long time for both a mount and umount
propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
parent mount and the destination mount's mountpoint on the parent
mount. Of course, if the parent of the destination mount and the
destination mount are shared mounts in the same peer group and the
mountpoint of the new mount to be mounted is a subdir of their
->mnt_root then both will receive a mount of /opt. That's probably
easier to understand with an example. Assuming a standard shared
rootfs /:

mount --bind /opt /opt
mount --bind /tmp /opt

will cause the same mount tree as:

mount --bind /opt /opt
mount --beneath /tmp /opt

because both / and /opt are shared mounts/peers in the same peer
group and the /opt dentry is a subdirectory of both the parent's and
the child's ->mnt_root. If a mount tree like that is created it almost
always is an accident or abuse of mount propagation. Realistically
what most people probably mean in this scenarios is:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt

This forces the allocation of a new separate peer group for the /opt
mount. Aferwards a mount --bind or mount --beneath actually makes
sense as the / and /opt mount belong to different peer groups. Before
that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
(1) the @mnt_from has been overmounted in between path resolution and
acquiring @namespace_sem when locking @mnt_to. This avoids the
proliferation of shadow mounts.
(2) if @to_mnt is moved to a different mountpoint while acquiring
@namespace_sem to lock @to_mnt.
(3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
@to_mnt.
(4) if the parent of the target mount propagates to the target mount
at the same mountpoint.
This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
whole purpose of mounting @mnt_from beneath @mnt_to.
(5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
the same mountpoint.
If @mnt_to->mnt_parent propagates to @mnt_from this would mean
propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
@mnt_from would be mounted on top of @mnt_to->mnt_parent and
@mnt_to would be unmounted from @mnt->mnt_parent and remounted on
@mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
would ultimately be remounted on top of @c. Afterwards, @mnt_from
would be covered by a copy @c of @mnt_from and @c would be covered
by @mnt_from itself. This defeats the whole purpose of mounting
@mnt_from beneath @mnt_to.
Cases (1) to (3) are required as they deal with races that would cause
bugs or unexpected behavior for users. Cases (4) and (5) refuse
semantical quirks that would not be a bug but would cause weird mount
trees to be created. While they can already be created via other means
(mount --bind /opt /opt x n) there's no reason to repeat past mistakes
in new features.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 6ac39281 Wed May 03 05:18:42 MDT 2023 Christian Brauner <brauner@kernel.org> fs: allow to mount beneath top mount

Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:

(1) When a service or container is started in a new mount namespace and
pivot_root()s into its new rootfs. The way this is done is by
mounting the new rootfs beneath the old rootfs:

fd_newroot = open("/var/lib/machines/fedora", ...);
fd_oldroot = open("/", ...);
fchdir(fd_newroot);
pivot_root(".", ".");

After the pivot_root(".", ".") call the new rootfs is mounted
beneath the old rootfs which can then be unmounted to reveal the
underlying mount:

fchdir(fd_oldroot);
umount2(".", MNT_DETACH);

Since pivot_root() moves the caller into a new rootfs no mounts must
be propagated out of the new rootfs as a consequence of the
pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
mounted on the same dentry.

The easiest example for this is to create a new mount namespace. The
following commands will create a mount namespace where the rootfs
mount / will be a slave to the peer group of the host rootfs /
mount's peer group. IOW, it will receive propagation from the host:

mount --make-shared /
unshare --mount --propagation=slave

Now a new mount on the /mnt dentry in that mount namespace is
created. (As it can be confusing it should be spelled out that the
tmpfs mount on the /mnt dentry that was just created doesn't
propagate back to the host because the rootfs mount / of the mount
namespace isn't a peer of the host rootfs.):

mount -t tmpfs tmpfs /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt tmpfs tmpfs

Now another terminal in the host mount namespace can observe that
the mount indeed hasn't propagated back to into the host mount
namespace. A new mount can now be created on top of the /mnt dentry
with the rootfs mount / as its parent:

mount --bind /opt /mnt

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 shared:1

The mount namespace that was created earlier can now observe that
the bind mount created on the host has propagated into it:

TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 master:1
└─/mnt tmpfs tmpfs

But instead of having been mounted on top of the tmpfs mount at the
/mnt dentry the /opt mount has been mounted on top of the rootfs
mount at the /mnt dentry. And the tmpfs mount has been remounted on
top of the propagated /opt mount at the /opt dentry. So in other
words, the propagated mount has been mounted beneath the preexisting
mount in that mount namespace.

Mount namespaces make this easy to illustrate but it's also easy to
mount beneath an existing mount in the same mount namespace
(The following example assumes a shared rootfs mount / with peer
group id 1):

mount --bind /opt /opt

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/opt] ext4 188 29 shared:1

If another mount is mounted on top of the /opt mount at the /opt
dentry:

mount --bind /tmp /opt

The following clunky mount tree will result:

TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/tmp] ext4 405 29 shared:1
└─/opt /dev/sda2[/opt] ext4 188 405 shared:1
└─/opt /dev/sda2[/tmp] ext4 404 188 shared:1

The /tmp mount is mounted beneath the /opt mount and another copy is
mounted on top of the /opt mount. This happens because the rootfs /
and the /opt mount are shared mounts in the same peer group.

When the new /tmp mount is supposed to be mounted at the /opt dentry
then the /tmp mount first propagates to the root mount at the /opt
dentry. But there already is the /opt mount mounted at the /opt
dentry. So the old /opt mount at the /opt dentry will be mounted on
top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
is /opt which is what shows up as /opt under SOURCE). So again, a
mount will be mounted beneath a preexisting mount.

(Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
shared rootfs is a good example of what could be referred to as
mount explosion.)

The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.

In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.

This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.

These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.

More common use-cases will just be things like:

mount -t btrfs /dev/sdA /mnt
mount -t xfs /dev/sdB --beneath /mnt
umount /mnt

after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:

* Mounts cannot be mounted beneath the rootfs. This restriction
encompasses the rootfs but also chroots via chroot() and pivot_root().
To mount a mount beneath the rootfs or a chroot, pivot_root() can be
used as illustrated above.
* The source mount must be a private mount to force the kernel to
allocate a new, unused peer group id. This isn't a required
restriction but a voluntary one. It avoids repeating a semantical
quirk that already exists today. If bind mounts which already have a
peer group id are inserted into mount trees that have the same peer
group id this can cause a lot of mount propagation events to be
generated (For example, consider running mount --bind /opt /opt in a
loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
need to be able to unmount the top mount themselves.
This also avoids a good deal of additional complexity. The umount
would have to be propagated which would be another rather expensive
operation. So namespace_lock() and lock_mount_hash() would potentially
have to be held for a long time for both a mount and umount
propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
parent mount and the destination mount's mountpoint on the parent
mount. Of course, if the parent of the destination mount and the
destination mount are shared mounts in the same peer group and the
mountpoint of the new mount to be mounted is a subdir of their
->mnt_root then both will receive a mount of /opt. That's probably
easier to understand with an example. Assuming a standard shared
rootfs /:

mount --bind /opt /opt
mount --bind /tmp /opt

will cause the same mount tree as:

mount --bind /opt /opt
mount --beneath /tmp /opt

because both / and /opt are shared mounts/peers in the same peer
group and the /opt dentry is a subdirectory of both the parent's and
the child's ->mnt_root. If a mount tree like that is created it almost
always is an accident or abuse of mount propagation. Realistically
what most people probably mean in this scenarios is:

mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt

This forces the allocation of a new separate peer group for the /opt
mount. Aferwards a mount --bind or mount --beneath actually makes
sense as the / and /opt mount belong to different peer groups. Before
that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
(1) the @mnt_from has been overmounted in between path resolution and
acquiring @namespace_sem when locking @mnt_to. This avoids the
proliferation of shadow mounts.
(2) if @to_mnt is moved to a different mountpoint while acquiring
@namespace_sem to lock @to_mnt.
(3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
@to_mnt.
(4) if the parent of the target mount propagates to the target mount
at the same mountpoint.
This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
whole purpose of mounting @mnt_from beneath @mnt_to.
(5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
the same mountpoint.
If @mnt_to->mnt_parent propagates to @mnt_from this would mean
propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
@mnt_from would be mounted on top of @mnt_to->mnt_parent and
@mnt_to would be unmounted from @mnt->mnt_parent and remounted on
@mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
would ultimately be remounted on top of @c. Afterwards, @mnt_from
would be covered by a copy @c of @mnt_from and @c would be covered
by @mnt_from itself. This defeats the whole purpose of mounting
@mnt_from beneath @mnt_to.
Cases (1) to (3) are required as they deal with races that would cause
bugs or unexpected behavior for users. Cases (4) and (5) refuse
semantical quirks that would not be a bug but would cause weird mount
trees to be created. While they can already be created via other means
(mount --bind /opt /opt x n) there's no reason to repeat past mistakes
in new features.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff a26f788b Thu Feb 03 06:14:08 MST 2022 Christian Brauner <brauner@kernel.org> fs: add mnt_allow_writers() and simplify mount_setattr_prepare()

Add a tiny helper that lets us simplify the control-flow and can be used
in the next patch to avoid adding another condition open-coded into
mount_setattr_prepare(). Instead we can add it into the new helper.

Link: https://lore.kernel.org/r/20220203131411.3093040-5-brauner@kernel.org
Cc: Seth Forshee <seth.forshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 79f6540b Thu Sep 02 15:55:10 MDT 2021 Vasily Averin <vvs@virtuozzo.com> memcg: enable accounting for mnt_cache entries

Patch series "memcg accounting from OpenVZ", v7.

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1. Now we're
rebasing again, revising our old patches and trying to push them upstream.

We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready
yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that they
are at least not worse as the according original RHEL7 kernel.

This patch (of 10):

The kernel allocates ~400 bytes of 'struct mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts, and this
can be repeated many times. Additionally, each mount allocates up to
PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 5b490500 Thu Jan 21 06:19:52 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> fs: add attr_flags_to_mnt_flags helper

Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags
which we will use in follow-up patches too.

Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Completed in 548 milliseconds