Searched +hist:1 +hist:b04fa99 (Results 1 - 3 of 3) sorted by relevance

/linux-master/kernel/rcu/
H A Dtasks.hdiff 1612160b Fri Feb 02 12:49:06 MST 2024 Paul E. McKenney <paulmck@kernel.org> rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks

Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock. This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch. However, such deadlocks are becoming
more frequent. In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.

In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state. And with some adjustments, RCU Tasks could
just as well take advantage of that fact.

This commit therefore eliminates these deadlock by replacing the
SRCU-based wait for do_exit() completion with per-CPU lists of tasks
currently exiting. A given task will be on one of these per-CPU lists for
the same period of time that this task would previously have been in the
previous SRCU read-side critical section. These lists enable RCU Tasks
to find the tasks that have already been removed from the tasks list,
but that must nevertheless be waited upon.

The RCU Tasks grace period gathers any of these do_exit() tasks that it
must wait on, and adds them to the list of holdouts. Per-CPU locking
and get_task_struct() are used to synchronize addition to and removal
from these lists.

Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/

Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
diff 6b70399f Fri Feb 02 12:28:45 MST 2024 Paul E. McKenney <paulmck@kernel.org> rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks

This commit continues the elimination of deadlocks involving do_exit()
and RCU tasks by causing exit_tasks_rcu_start() to add the current
task to a per-CPU list and causing exit_tasks_rcu_stop() to remove the
current task from whatever list it is on. These lists will be used to
track tasks that are exiting, while still accounting for any RCU-tasks
quiescent states that these tasks pass though.

[ paulmck: Apply Frederic Weisbecker feedback. ]

Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/

Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
diff 46faf9d8 Mon Feb 05 14:10:19 MST 2024 Paul E. McKenney <paulmck@kernel.org> rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks

Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock. This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch. However, such deadlocks are becoming
more frequent. In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.

In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state. And with some adjustments, RCU Tasks could
just as well take advantage of that fact.

This commit therefore initializes the data structures that will be needed
to rely on these quiescent states and to eliminate these deadlocks.

Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/

Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
diff bfe93930 Mon Feb 05 14:08:22 MST 2024 Paul E. McKenney <paulmck@kernel.org> rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks

Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock. This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch. However, such deadlocks are becoming
more frequent. In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.

In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state. And with some adjustments, RCU Tasks could
just as well take advantage of that fact.

This commit therefore adds the data structures that will be needed
to rely on these quiescent states and to eliminate these deadlocks.

Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/

Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
diff 9715ed50 Fri Oct 27 08:40:48 MDT 2023 Frederic Weisbecker <frederic@kernel.org> rcu/tasks: Handle new PF_IDLE semantics

The commit:

cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")

has changed the semantics of what is to be considered an idle task in
such a way that CPU boot code preceding the actual idle loop is excluded
from it.

This has however introduced new potential RCU-tasks stalls when either:

1) Grace period is started before init/0 had a chance to set PF_IDLE,
keeping it stuck in the holdout list until idle ever schedules.

2) Grace period is started when some possible CPUs have never been
online, keeping their idle tasks stuck in the holdout list until the
CPU ever boots up.

3) Similar to 1) but with secondary CPUs: Grace period is started
concurrently with secondary CPU booting, putting its idle task in
the holdout list because PF_IDLE isn't yet observed on it. It stays
then stuck in the holdout list until that CPU ever schedules. The
effect is mitigated here by the hotplug AP thread that must run to
bring the CPU up.

Fix this with handling the new semantics of PF_IDLE, keeping in mind
that it may or may not be set on an idle task. Take advantage of that to
strengthen the coverage of an RCU-tasks quiescent state within an idle
task, excluding the CPU boot code from it. Only the code running within
the idle loop is now a quiescent state, along with offline CPUs.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
diff 9715ed50 Fri Oct 27 08:40:48 MDT 2023 Frederic Weisbecker <frederic@kernel.org> rcu/tasks: Handle new PF_IDLE semantics

The commit:

cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")

has changed the semantics of what is to be considered an idle task in
such a way that CPU boot code preceding the actual idle loop is excluded
from it.

This has however introduced new potential RCU-tasks stalls when either:

1) Grace period is started before init/0 had a chance to set PF_IDLE,
keeping it stuck in the holdout list until idle ever schedules.

2) Grace period is started when some possible CPUs have never been
online, keeping their idle tasks stuck in the holdout list until the
CPU ever boots up.

3) Similar to 1) but with secondary CPUs: Grace period is started
concurrently with secondary CPU booting, putting its idle task in
the holdout list because PF_IDLE isn't yet observed on it. It stays
then stuck in the holdout list until that CPU ever schedules. The
effect is mitigated here by the hotplug AP thread that must run to
bring the CPU up.

Fix this with handling the new semantics of PF_IDLE, keeping in mind
that it may or may not be set on an idle task. Take advantage of that to
strengthen the coverage of an RCU-tasks quiescent state within an idle
task, excluding the CPU boot code from it. Only the code running within
the idle loop is now a quiescent state, along with offline CPUs.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
diff db13710a Mon Jun 26 14:45:13 MDT 2023 Paul E. McKenney <paulmck@kernel.org> rcu-tasks: Cancel callback laziness if too many callbacks

The various RCU Tasks flavors now do lazy grace periods when there are
only asynchronous grace period requests. By default, the system will let
250 milliseconds elapse after the first call_rcu_tasks*() callbacki is
queued before starting a grace period. In contrast, synchronous grace
period requests such as synchronize_rcu_tasks*() will start a grace
period immediately.

However, invoking one of the call_rcu_tasks*() functions in a too-tight
loop can result in a callback flood, which in turn can exhaust memory
if grace periods are delayed for too long.

This commit therefore sets a limit so that the grace-period kthread
will be awakened when any CPU's callback list expands to contain
rcupdate.rcu_task_lazy_lim callbacks elements (defaulting to 32, set to -1
to disable), the grace-period kthread will be awakened, thus cancelling
any ongoing laziness and getting out in front of the potential callback
flood.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 5fc8cbe4 Tue Aug 02 10:22:05 MDT 2022 Shigeru Yoshida <syoshida@redhat.com> rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()

pr_info() is called with rtp->cbs_gbl_lock spin lock locked. Because
pr_info() calls printk() that might sleep, this will result in BUG
like below:

[ 0.206455] cblist_init_generic: Setting adjustable number of callback queues.
[ 0.206463]
[ 0.206464] =============================
[ 0.206464] [ BUG: Invalid wait context ]
[ 0.206465] 5.19.0-00428-g9de1f9c8ca51 #5 Not tainted
[ 0.206466] -----------------------------
[ 0.206466] swapper/0/1 is trying to lock:
[ 0.206467] ffffffffa0167a58 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x327/0x4a0
[ 0.206473] other info that might help us debug this:
[ 0.206473] context-{5:5}
[ 0.206474] 3 locks held by swapper/0/1:
[ 0.206474] #0: ffffffff9eb597e0 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic.constprop.0+0x14/0x1f0
[ 0.206478] #1: ffffffff9eb579c0 (console_lock){+.+.}-{0:0}, at: _printk+0x63/0x7e
[ 0.206482] #2: ffffffff9ea77780 (console_owner){....}-{0:0}, at: console_emit_next_record.constprop.0+0x111/0x330
[ 0.206485] stack backtrace:
[ 0.206486] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-00428-g9de1f9c8ca51 #5
[ 0.206488] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 0.206489] Call Trace:
[ 0.206490] <TASK>
[ 0.206491] dump_stack_lvl+0x6a/0x9f
[ 0.206493] __lock_acquire.cold+0x2d7/0x2fe
[ 0.206496] ? stack_trace_save+0x46/0x70
[ 0.206497] lock_acquire+0xd1/0x2f0
[ 0.206499] ? serial8250_console_write+0x327/0x4a0
[ 0.206500] ? __lock_acquire+0x5c7/0x2720
[ 0.206502] _raw_spin_lock_irqsave+0x3d/0x90
[ 0.206504] ? serial8250_console_write+0x327/0x4a0
[ 0.206506] serial8250_console_write+0x327/0x4a0
[ 0.206508] console_emit_next_record.constprop.0+0x180/0x330
[ 0.206511] console_unlock+0xf7/0x1f0
[ 0.206512] vprintk_emit+0xf7/0x330
[ 0.206514] _printk+0x63/0x7e
[ 0.206516] cblist_init_generic.constprop.0.cold+0x24/0x32
[ 0.206518] rcu_init_tasks_generic+0x5/0xd9
[ 0.206522] kernel_init_freeable+0x15b/0x2a2
[ 0.206523] ? rest_init+0x160/0x160
[ 0.206526] kernel_init+0x11/0x120
[ 0.206527] ret_from_fork+0x1f/0x30
[ 0.206530] </TASK>
[ 0.207018] cblist_init_generic: Setting shift to 1 and lim to 1.

This patch moves pr_info() so that it is called without
rtp->cbs_gbl_lock locked.

Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 5fc8cbe4 Tue Aug 02 10:22:05 MDT 2022 Shigeru Yoshida <syoshida@redhat.com> rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()

pr_info() is called with rtp->cbs_gbl_lock spin lock locked. Because
pr_info() calls printk() that might sleep, this will result in BUG
like below:

[ 0.206455] cblist_init_generic: Setting adjustable number of callback queues.
[ 0.206463]
[ 0.206464] =============================
[ 0.206464] [ BUG: Invalid wait context ]
[ 0.206465] 5.19.0-00428-g9de1f9c8ca51 #5 Not tainted
[ 0.206466] -----------------------------
[ 0.206466] swapper/0/1 is trying to lock:
[ 0.206467] ffffffffa0167a58 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x327/0x4a0
[ 0.206473] other info that might help us debug this:
[ 0.206473] context-{5:5}
[ 0.206474] 3 locks held by swapper/0/1:
[ 0.206474] #0: ffffffff9eb597e0 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic.constprop.0+0x14/0x1f0
[ 0.206478] #1: ffffffff9eb579c0 (console_lock){+.+.}-{0:0}, at: _printk+0x63/0x7e
[ 0.206482] #2: ffffffff9ea77780 (console_owner){....}-{0:0}, at: console_emit_next_record.constprop.0+0x111/0x330
[ 0.206485] stack backtrace:
[ 0.206486] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-00428-g9de1f9c8ca51 #5
[ 0.206488] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 0.206489] Call Trace:
[ 0.206490] <TASK>
[ 0.206491] dump_stack_lvl+0x6a/0x9f
[ 0.206493] __lock_acquire.cold+0x2d7/0x2fe
[ 0.206496] ? stack_trace_save+0x46/0x70
[ 0.206497] lock_acquire+0xd1/0x2f0
[ 0.206499] ? serial8250_console_write+0x327/0x4a0
[ 0.206500] ? __lock_acquire+0x5c7/0x2720
[ 0.206502] _raw_spin_lock_irqsave+0x3d/0x90
[ 0.206504] ? serial8250_console_write+0x327/0x4a0
[ 0.206506] serial8250_console_write+0x327/0x4a0
[ 0.206508] console_emit_next_record.constprop.0+0x180/0x330
[ 0.206511] console_unlock+0xf7/0x1f0
[ 0.206512] vprintk_emit+0xf7/0x330
[ 0.206514] _printk+0x63/0x7e
[ 0.206516] cblist_init_generic.constprop.0.cold+0x24/0x32
[ 0.206518] rcu_init_tasks_generic+0x5/0xd9
[ 0.206522] kernel_init_freeable+0x15b/0x2a2
[ 0.206523] ? rest_init+0x160/0x160
[ 0.206526] kernel_init+0x11/0x120
[ 0.206527] ret_from_fork+0x1f/0x30
[ 0.206530] </TASK>
[ 0.207018] cblist_init_generic: Setting shift to 1 and lim to 1.

This patch moves pr_info() so that it is called without
rtp->cbs_gbl_lock locked.

Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 5fc8cbe4 Tue Aug 02 10:22:05 MDT 2022 Shigeru Yoshida <syoshida@redhat.com> rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()

pr_info() is called with rtp->cbs_gbl_lock spin lock locked. Because
pr_info() calls printk() that might sleep, this will result in BUG
like below:

[ 0.206455] cblist_init_generic: Setting adjustable number of callback queues.
[ 0.206463]
[ 0.206464] =============================
[ 0.206464] [ BUG: Invalid wait context ]
[ 0.206465] 5.19.0-00428-g9de1f9c8ca51 #5 Not tainted
[ 0.206466] -----------------------------
[ 0.206466] swapper/0/1 is trying to lock:
[ 0.206467] ffffffffa0167a58 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x327/0x4a0
[ 0.206473] other info that might help us debug this:
[ 0.206473] context-{5:5}
[ 0.206474] 3 locks held by swapper/0/1:
[ 0.206474] #0: ffffffff9eb597e0 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic.constprop.0+0x14/0x1f0
[ 0.206478] #1: ffffffff9eb579c0 (console_lock){+.+.}-{0:0}, at: _printk+0x63/0x7e
[ 0.206482] #2: ffffffff9ea77780 (console_owner){....}-{0:0}, at: console_emit_next_record.constprop.0+0x111/0x330
[ 0.206485] stack backtrace:
[ 0.206486] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-00428-g9de1f9c8ca51 #5
[ 0.206488] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 0.206489] Call Trace:
[ 0.206490] <TASK>
[ 0.206491] dump_stack_lvl+0x6a/0x9f
[ 0.206493] __lock_acquire.cold+0x2d7/0x2fe
[ 0.206496] ? stack_trace_save+0x46/0x70
[ 0.206497] lock_acquire+0xd1/0x2f0
[ 0.206499] ? serial8250_console_write+0x327/0x4a0
[ 0.206500] ? __lock_acquire+0x5c7/0x2720
[ 0.206502] _raw_spin_lock_irqsave+0x3d/0x90
[ 0.206504] ? serial8250_console_write+0x327/0x4a0
[ 0.206506] serial8250_console_write+0x327/0x4a0
[ 0.206508] console_emit_next_record.constprop.0+0x180/0x330
[ 0.206511] console_unlock+0xf7/0x1f0
[ 0.206512] vprintk_emit+0xf7/0x330
[ 0.206514] _printk+0x63/0x7e
[ 0.206516] cblist_init_generic.constprop.0.cold+0x24/0x32
[ 0.206518] rcu_init_tasks_generic+0x5/0xd9
[ 0.206522] kernel_init_freeable+0x15b/0x2a2
[ 0.206523] ? rest_init+0x160/0x160
[ 0.206526] kernel_init+0x11/0x120
[ 0.206527] ret_from_fork+0x1f/0x30
[ 0.206530] </TASK>
[ 0.207018] cblist_init_generic: Setting shift to 1 and lim to 1.

This patch moves pr_info() so that it is called without
rtp->cbs_gbl_lock locked.

Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 5fc8cbe4 Tue Aug 02 10:22:05 MDT 2022 Shigeru Yoshida <syoshida@redhat.com> rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()

pr_info() is called with rtp->cbs_gbl_lock spin lock locked. Because
pr_info() calls printk() that might sleep, this will result in BUG
like below:

[ 0.206455] cblist_init_generic: Setting adjustable number of callback queues.
[ 0.206463]
[ 0.206464] =============================
[ 0.206464] [ BUG: Invalid wait context ]
[ 0.206465] 5.19.0-00428-g9de1f9c8ca51 #5 Not tainted
[ 0.206466] -----------------------------
[ 0.206466] swapper/0/1 is trying to lock:
[ 0.206467] ffffffffa0167a58 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x327/0x4a0
[ 0.206473] other info that might help us debug this:
[ 0.206473] context-{5:5}
[ 0.206474] 3 locks held by swapper/0/1:
[ 0.206474] #0: ffffffff9eb597e0 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic.constprop.0+0x14/0x1f0
[ 0.206478] #1: ffffffff9eb579c0 (console_lock){+.+.}-{0:0}, at: _printk+0x63/0x7e
[ 0.206482] #2: ffffffff9ea77780 (console_owner){....}-{0:0}, at: console_emit_next_record.constprop.0+0x111/0x330
[ 0.206485] stack backtrace:
[ 0.206486] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-00428-g9de1f9c8ca51 #5
[ 0.206488] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 0.206489] Call Trace:
[ 0.206490] <TASK>
[ 0.206491] dump_stack_lvl+0x6a/0x9f
[ 0.206493] __lock_acquire.cold+0x2d7/0x2fe
[ 0.206496] ? stack_trace_save+0x46/0x70
[ 0.206497] lock_acquire+0xd1/0x2f0
[ 0.206499] ? serial8250_console_write+0x327/0x4a0
[ 0.206500] ? __lock_acquire+0x5c7/0x2720
[ 0.206502] _raw_spin_lock_irqsave+0x3d/0x90
[ 0.206504] ? serial8250_console_write+0x327/0x4a0
[ 0.206506] serial8250_console_write+0x327/0x4a0
[ 0.206508] console_emit_next_record.constprop.0+0x180/0x330
[ 0.206511] console_unlock+0xf7/0x1f0
[ 0.206512] vprintk_emit+0xf7/0x330
[ 0.206514] _printk+0x63/0x7e
[ 0.206516] cblist_init_generic.constprop.0.cold+0x24/0x32
[ 0.206518] rcu_init_tasks_generic+0x5/0xd9
[ 0.206522] kernel_init_freeable+0x15b/0x2a2
[ 0.206523] ? rest_init+0x160/0x160
[ 0.206526] kernel_init+0x11/0x120
[ 0.206527] ret_from_fork+0x1f/0x30
[ 0.206530] </TASK>
[ 0.207018] cblist_init_generic: Setting shift to 1 and lim to 1.

This patch moves pr_info() so that it is called without
rtp->cbs_gbl_lock locked.

Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 5fc8cbe4 Tue Aug 02 10:22:05 MDT 2022 Shigeru Yoshida <syoshida@redhat.com> rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()

pr_info() is called with rtp->cbs_gbl_lock spin lock locked. Because
pr_info() calls printk() that might sleep, this will result in BUG
like below:

[ 0.206455] cblist_init_generic: Setting adjustable number of callback queues.
[ 0.206463]
[ 0.206464] =============================
[ 0.206464] [ BUG: Invalid wait context ]
[ 0.206465] 5.19.0-00428-g9de1f9c8ca51 #5 Not tainted
[ 0.206466] -----------------------------
[ 0.206466] swapper/0/1 is trying to lock:
[ 0.206467] ffffffffa0167a58 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x327/0x4a0
[ 0.206473] other info that might help us debug this:
[ 0.206473] context-{5:5}
[ 0.206474] 3 locks held by swapper/0/1:
[ 0.206474] #0: ffffffff9eb597e0 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic.constprop.0+0x14/0x1f0
[ 0.206478] #1: ffffffff9eb579c0 (console_lock){+.+.}-{0:0}, at: _printk+0x63/0x7e
[ 0.206482] #2: ffffffff9ea77780 (console_owner){....}-{0:0}, at: console_emit_next_record.constprop.0+0x111/0x330
[ 0.206485] stack backtrace:
[ 0.206486] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-00428-g9de1f9c8ca51 #5
[ 0.206488] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 0.206489] Call Trace:
[ 0.206490] <TASK>
[ 0.206491] dump_stack_lvl+0x6a/0x9f
[ 0.206493] __lock_acquire.cold+0x2d7/0x2fe
[ 0.206496] ? stack_trace_save+0x46/0x70
[ 0.206497] lock_acquire+0xd1/0x2f0
[ 0.206499] ? serial8250_console_write+0x327/0x4a0
[ 0.206500] ? __lock_acquire+0x5c7/0x2720
[ 0.206502] _raw_spin_lock_irqsave+0x3d/0x90
[ 0.206504] ? serial8250_console_write+0x327/0x4a0
[ 0.206506] serial8250_console_write+0x327/0x4a0
[ 0.206508] console_emit_next_record.constprop.0+0x180/0x330
[ 0.206511] console_unlock+0xf7/0x1f0
[ 0.206512] vprintk_emit+0xf7/0x330
[ 0.206514] _printk+0x63/0x7e
[ 0.206516] cblist_init_generic.constprop.0.cold+0x24/0x32
[ 0.206518] rcu_init_tasks_generic+0x5/0xd9
[ 0.206522] kernel_init_freeable+0x15b/0x2a2
[ 0.206523] ? rest_init+0x160/0x160
[ 0.206526] kernel_init+0x11/0x120
[ 0.206527] ret_from_fork+0x1f/0x30
[ 0.206530] </TASK>
[ 0.207018] cblist_init_generic: Setting shift to 1 and lim to 1.

This patch moves pr_info() so that it is called without
rtp->cbs_gbl_lock locked.

Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 5fc8cbe4 Tue Aug 02 10:22:05 MDT 2022 Shigeru Yoshida <syoshida@redhat.com> rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()

pr_info() is called with rtp->cbs_gbl_lock spin lock locked. Because
pr_info() calls printk() that might sleep, this will result in BUG
like below:

[ 0.206455] cblist_init_generic: Setting adjustable number of callback queues.
[ 0.206463]
[ 0.206464] =============================
[ 0.206464] [ BUG: Invalid wait context ]
[ 0.206465] 5.19.0-00428-g9de1f9c8ca51 #5 Not tainted
[ 0.206466] -----------------------------
[ 0.206466] swapper/0/1 is trying to lock:
[ 0.206467] ffffffffa0167a58 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x327/0x4a0
[ 0.206473] other info that might help us debug this:
[ 0.206473] context-{5:5}
[ 0.206474] 3 locks held by swapper/0/1:
[ 0.206474] #0: ffffffff9eb597e0 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic.constprop.0+0x14/0x1f0
[ 0.206478] #1: ffffffff9eb579c0 (console_lock){+.+.}-{0:0}, at: _printk+0x63/0x7e
[ 0.206482] #2: ffffffff9ea77780 (console_owner){....}-{0:0}, at: console_emit_next_record.constprop.0+0x111/0x330
[ 0.206485] stack backtrace:
[ 0.206486] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-00428-g9de1f9c8ca51 #5
[ 0.206488] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 0.206489] Call Trace:
[ 0.206490] <TASK>
[ 0.206491] dump_stack_lvl+0x6a/0x9f
[ 0.206493] __lock_acquire.cold+0x2d7/0x2fe
[ 0.206496] ? stack_trace_save+0x46/0x70
[ 0.206497] lock_acquire+0xd1/0x2f0
[ 0.206499] ? serial8250_console_write+0x327/0x4a0
[ 0.206500] ? __lock_acquire+0x5c7/0x2720
[ 0.206502] _raw_spin_lock_irqsave+0x3d/0x90
[ 0.206504] ? serial8250_console_write+0x327/0x4a0
[ 0.206506] serial8250_console_write+0x327/0x4a0
[ 0.206508] console_emit_next_record.constprop.0+0x180/0x330
[ 0.206511] console_unlock+0xf7/0x1f0
[ 0.206512] vprintk_emit+0xf7/0x330
[ 0.206514] _printk+0x63/0x7e
[ 0.206516] cblist_init_generic.constprop.0.cold+0x24/0x32
[ 0.206518] rcu_init_tasks_generic+0x5/0xd9
[ 0.206522] kernel_init_freeable+0x15b/0x2a2
[ 0.206523] ? rest_init+0x160/0x160
[ 0.206526] kernel_init+0x11/0x120
[ 0.206527] ret_from_fork+0x1f/0x30
[ 0.206530] </TASK>
[ 0.207018] cblist_init_generic: Setting shift to 1 and lim to 1.

This patch moves pr_info() so that it is called without
rtp->cbs_gbl_lock locked.

Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 5fc8cbe4 Tue Aug 02 10:22:05 MDT 2022 Shigeru Yoshida <syoshida@redhat.com> rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()

pr_info() is called with rtp->cbs_gbl_lock spin lock locked. Because
pr_info() calls printk() that might sleep, this will result in BUG
like below:

[ 0.206455] cblist_init_generic: Setting adjustable number of callback queues.
[ 0.206463]
[ 0.206464] =============================
[ 0.206464] [ BUG: Invalid wait context ]
[ 0.206465] 5.19.0-00428-g9de1f9c8ca51 #5 Not tainted
[ 0.206466] -----------------------------
[ 0.206466] swapper/0/1 is trying to lock:
[ 0.206467] ffffffffa0167a58 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x327/0x4a0
[ 0.206473] other info that might help us debug this:
[ 0.206473] context-{5:5}
[ 0.206474] 3 locks held by swapper/0/1:
[ 0.206474] #0: ffffffff9eb597e0 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic.constprop.0+0x14/0x1f0
[ 0.206478] #1: ffffffff9eb579c0 (console_lock){+.+.}-{0:0}, at: _printk+0x63/0x7e
[ 0.206482] #2: ffffffff9ea77780 (console_owner){....}-{0:0}, at: console_emit_next_record.constprop.0+0x111/0x330
[ 0.206485] stack backtrace:
[ 0.206486] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-00428-g9de1f9c8ca51 #5
[ 0.206488] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 0.206489] Call Trace:
[ 0.206490] <TASK>
[ 0.206491] dump_stack_lvl+0x6a/0x9f
[ 0.206493] __lock_acquire.cold+0x2d7/0x2fe
[ 0.206496] ? stack_trace_save+0x46/0x70
[ 0.206497] lock_acquire+0xd1/0x2f0
[ 0.206499] ? serial8250_console_write+0x327/0x4a0
[ 0.206500] ? __lock_acquire+0x5c7/0x2720
[ 0.206502] _raw_spin_lock_irqsave+0x3d/0x90
[ 0.206504] ? serial8250_console_write+0x327/0x4a0
[ 0.206506] serial8250_console_write+0x327/0x4a0
[ 0.206508] console_emit_next_record.constprop.0+0x180/0x330
[ 0.206511] console_unlock+0xf7/0x1f0
[ 0.206512] vprintk_emit+0xf7/0x330
[ 0.206514] _printk+0x63/0x7e
[ 0.206516] cblist_init_generic.constprop.0.cold+0x24/0x32
[ 0.206518] rcu_init_tasks_generic+0x5/0xd9
[ 0.206522] kernel_init_freeable+0x15b/0x2a2
[ 0.206523] ? rest_init+0x160/0x160
[ 0.206526] kernel_init+0x11/0x120
[ 0.206527] ret_from_fork+0x1f/0x30
[ 0.206530] </TASK>
[ 0.207018] cblist_init_generic: Setting shift to 1 and lim to 1.

This patch moves pr_info() so that it is called without
rtp->cbs_gbl_lock locked.

Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
/linux-master/include/linux/
H A Drcupdate.hdiff 1a77557d Tue Mar 19 14:44:34 MDT 2024 Yan Zhai <yan@cloudflare.com> rcu: add a helper to report consolidated flavor QS

When under heavy load, network processing can run CPU-bound for many
tens of seconds. Even in preemptible kernels (non-RT kernel), this can
block RCU Tasks grace periods, which can cause trace-event removal to
take more than a minute, which is unacceptably long.

This commit therefore creates a new helper function that passes through
both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
hard-coded value suffices for current workloads.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/90431d46ee112d2b0af04dbfe936faaca11810a5.1710877680.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 28319d6d Fri Nov 25 06:55:00 MST 2022 Frederic Weisbecker <frederic@kernel.org> rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()

RCU Tasks and PID-namespace unshare can interact in do_exit() in a
complicated circular dependency:

1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
that every subsequent child of TASK A will belong to. But TASK A
doesn't itself belong to that new PID namespace.

2) TASK A forks() and creates TASK B. TASK A stays attached to its PID
namespace (let's say PID_NS1) and TASK B is the first task belonging
to the new PID namespace created by unshare() (let's call it PID_NS2).

3) Since TASK B is the first task attached to PID_NS2, it becomes the
PID_NS2 child reaper.

4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.

5) TASK B exits and since it is the child reaper for PID_NS2, it has to
kill all other tasks attached to PID_NS2, and wait for all of them to
die before getting reaped itself (zap_pid_ns_process()).

6) TASK A calls synchronize_rcu_tasks() which leads to
synchronize_srcu(&tasks_rcu_exit_srcu).

7) TASK B is waiting for TASK C to get reaped. But TASK B is under a
tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between
exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A.

8) TASK C exits and since TASK A is its parent, it waits for it to reap
TASK C, but it can't because TASK A waits for TASK B that waits for
TASK C.

Pid_namespace semantics can hardly be changed at this point. But the
coverage of tasks_rcu_exit_srcu can be reduced instead.

The current task is assumed not to be concurrently reapable at this
stage of exit_notify() and therefore tasks_rcu_exit_srcu can be
temporarily relaxed without breaking its constraints, providing a way
out of the deadlock scenario.

[ paulmck: Fix build failure by adding additional declaration. ]

Fixes: 3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Eric W . Biederman <ebiederm@xmission.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 414c1238 Wed Apr 13 16:17:25 MDT 2022 Paul E. McKenney <paulmck@kernel.org> rcu: Provide a get_completed_synchronize_rcu() function

It is currently up to the caller to handle stale return values from
get_state_synchronize_rcu(). If poll_state_synchronize_rcu() returned
true once, a grace period has elapsed, regardless of the fact that counter
wrap might cause some future poll_state_synchronize_rcu() invocation to
return false. For example, the caller might store a separate flag that
indicates whether some previous call to poll_state_synchronize_rcu()
determined that the relevant grace period had already ended.

This approach works, but it requires extra storage and is easy to get
wrong. This commit therefore introduces a get_completed_synchronize_rcu()
that returns a cookie that causes poll_state_synchronize_rcu() to always
return true. This already-completed cookie can be stored in place of the
cookie that previously caused poll_state_synchronize_rcu() to return true.
It can also be used to flag a given structure as not having been exposed
to readers, and thus not requiring a grace period to elapse.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 414c1238 Wed Apr 13 16:17:25 MDT 2022 Paul E. McKenney <paulmck@kernel.org> rcu: Provide a get_completed_synchronize_rcu() function

It is currently up to the caller to handle stale return values from
get_state_synchronize_rcu(). If poll_state_synchronize_rcu() returned
true once, a grace period has elapsed, regardless of the fact that counter
wrap might cause some future poll_state_synchronize_rcu() invocation to
return false. For example, the caller might store a separate flag that
indicates whether some previous call to poll_state_synchronize_rcu()
determined that the relevant grace period had already ended.

This approach works, but it requires extra storage and is easy to get
wrong. This commit therefore introduces a get_completed_synchronize_rcu()
that returns a cookie that causes poll_state_synchronize_rcu() to always
return true. This already-completed cookie can be stored in place of the
cookie that previously caused poll_state_synchronize_rcu() to return true.
It can also be used to flag a given structure as not having been exposed
to readers, and thus not requiring a grace period to elapse.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 30668200 Mon Apr 05 10:51:05 MDT 2021 Paul E. McKenney <paulmck@kernel.org> rcu: Reject RCU_LOCKDEP_WARN() false positives

If another lockdep report runs concurrently with an RCU lockdep report
from RCU_LOCKDEP_WARN(), the following sequence of events can occur:

1. debug_lockdep_rcu_enabled() sees that lockdep is enabled
when called from (say) synchronize_rcu().

2. Lockdep is disabled by a concurrent lockdep report.

3. debug_lockdep_rcu_enabled() evaluates its lockdep-expression
argument, for example, lock_is_held(&rcu_bh_lock_map).

4. Because lockdep is now disabled, lock_is_held() plays it safe and
returns the constant 1.

5. But in this case, the constant 1 is not safe, because invoking
synchronize_rcu() under rcu_read_lock_bh() is disallowed.

6. debug_lockdep_rcu_enabled() wrongly invokes lockdep_rcu_suspicious(),
resulting in a false-positive splat.

This commit therefore changes RCU_LOCKDEP_WARN() to check
debug_lockdep_rcu_enabled() after checking the lockdep expression,
so that any "safe" returns from lock_is_held() are rejected by
debug_lockdep_rcu_enabled(). This requires memory ordering, which is
supplied by READ_ONCE(debug_locks). The resulting volatile accesses
prevent the compiler from reordering and the fact that only one variable
is being accessed prevents the underlying hardware from reordering.
The combination works for IA64, which can reorder reads to the same
location, but this is defeated by the volatile accesses, which compile
to load instructions that provide ordering.

Reported-by: syzbot+dde0cc33951735441301@syzkaller.appspotmail.com
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: syzbot+88e4f02896967fe1ab0d@syzkaller.appspotmail.com
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 30668200 Mon Apr 05 10:51:05 MDT 2021 Paul E. McKenney <paulmck@kernel.org> rcu: Reject RCU_LOCKDEP_WARN() false positives

If another lockdep report runs concurrently with an RCU lockdep report
from RCU_LOCKDEP_WARN(), the following sequence of events can occur:

1. debug_lockdep_rcu_enabled() sees that lockdep is enabled
when called from (say) synchronize_rcu().

2. Lockdep is disabled by a concurrent lockdep report.

3. debug_lockdep_rcu_enabled() evaluates its lockdep-expression
argument, for example, lock_is_held(&rcu_bh_lock_map).

4. Because lockdep is now disabled, lock_is_held() plays it safe and
returns the constant 1.

5. But in this case, the constant 1 is not safe, because invoking
synchronize_rcu() under rcu_read_lock_bh() is disallowed.

6. debug_lockdep_rcu_enabled() wrongly invokes lockdep_rcu_suspicious(),
resulting in a false-positive splat.

This commit therefore changes RCU_LOCKDEP_WARN() to check
debug_lockdep_rcu_enabled() after checking the lockdep expression,
so that any "safe" returns from lock_is_held() are rejected by
debug_lockdep_rcu_enabled(). This requires memory ordering, which is
supplied by READ_ONCE(debug_locks). The resulting volatile accesses
prevent the compiler from reordering and the fact that only one variable
is being accessed prevents the underlying hardware from reordering.
The combination works for IA64, which can reorder reads to the same
location, but this is defeated by the volatile accesses, which compile
to load instructions that provide ordering.

Reported-by: syzbot+dde0cc33951735441301@syzkaller.appspotmail.com
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: syzbot+88e4f02896967fe1ab0d@syzkaller.appspotmail.com
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 30668200 Mon Apr 05 10:51:05 MDT 2021 Paul E. McKenney <paulmck@kernel.org> rcu: Reject RCU_LOCKDEP_WARN() false positives

If another lockdep report runs concurrently with an RCU lockdep report
from RCU_LOCKDEP_WARN(), the following sequence of events can occur:

1. debug_lockdep_rcu_enabled() sees that lockdep is enabled
when called from (say) synchronize_rcu().

2. Lockdep is disabled by a concurrent lockdep report.

3. debug_lockdep_rcu_enabled() evaluates its lockdep-expression
argument, for example, lock_is_held(&rcu_bh_lock_map).

4. Because lockdep is now disabled, lock_is_held() plays it safe and
returns the constant 1.

5. But in this case, the constant 1 is not safe, because invoking
synchronize_rcu() under rcu_read_lock_bh() is disallowed.

6. debug_lockdep_rcu_enabled() wrongly invokes lockdep_rcu_suspicious(),
resulting in a false-positive splat.

This commit therefore changes RCU_LOCKDEP_WARN() to check
debug_lockdep_rcu_enabled() after checking the lockdep expression,
so that any "safe" returns from lock_is_held() are rejected by
debug_lockdep_rcu_enabled(). This requires memory ordering, which is
supplied by READ_ONCE(debug_locks). The resulting volatile accesses
prevent the compiler from reordering and the fact that only one variable
is being accessed prevents the underlying hardware from reordering.
The combination works for IA64, which can reorder reads to the same
location, but this is defeated by the volatile accesses, which compile
to load instructions that provide ordering.

Reported-by: syzbot+dde0cc33951735441301@syzkaller.appspotmail.com
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: syzbot+88e4f02896967fe1ab0d@syzkaller.appspotmail.com
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 1b04fa99 Wed Dec 09 13:27:31 MST 2020 Uladzislau Rezki (Sony) <urezki@gmail.com> rcu-tasks: Move RCU-tasks initialization to before early_initcall()

PowerPC testing encountered boot failures due to RCU Tasks not being
fully initialized until core_initcall() time. This commit therefore
initializes RCU Tasks (along with Rude RCU and RCU Tasks Trace) just
before early_initcall() time, thus allowing waiting on RCU Tasks grace
periods from early_initcall() handlers.

Link: https://lore.kernel.org/rcu/87eekfh80a.fsf@dja-thinkpad.axtens.net/
Fixes: 36dadef23fcc ("kprobes: Init kprobes in early_initcall")
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff 1b04fa99 Wed Dec 09 13:27:31 MST 2020 Uladzislau Rezki (Sony) <urezki@gmail.com> rcu-tasks: Move RCU-tasks initialization to before early_initcall()

PowerPC testing encountered boot failures due to RCU Tasks not being
fully initialized until core_initcall() time. This commit therefore
initializes RCU Tasks (along with Rude RCU and RCU Tasks Trace) just
before early_initcall() time, thus allowing waiting on RCU Tasks grace
periods from early_initcall() handlers.

Link: https://lore.kernel.org/rcu/87eekfh80a.fsf@dja-thinkpad.axtens.net/
Fixes: 36dadef23fcc ("kprobes: Init kprobes in early_initcall")
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff c408b215 Mon May 25 15:47:55 MDT 2020 Uladzislau Rezki (Sony) <urezki@gmail.com> rcu: Rename *_kfree_callback/*_kfree_rcu_offset/kfree_call_*

The following changes are introduced:

1. Rename rcu_invoke_kfree_callback() to rcu_invoke_kvfree_callback(),
as well as the associated trace events, so the rcu_kfree_callback(),
becomes rcu_kvfree_callback(). The reason is to be aligned with kvfree()
notation.

2. Rename __is_kfree_rcu_offset to __is_kvfree_rcu_offset. All RCU
paths use kvfree() now instead of kfree(), thus rename it.

3. Rename kfree_call_rcu() to the kvfree_call_rcu(). The reason is,
it is capable of freeing vmalloc() memory now. Do the same with
__kfree_rcu() macro, it becomes __kvfree_rcu(), the goal is the
same.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
/linux-master/init/
H A Dmain.cdiff 46dad3c1 Fri Apr 12 02:17:32 MDT 2024 Yuntao Wang <ytcoode@gmail.com> init/main.c: Fix potential static_command_line memory overflow

We allocate memory of size 'xlen + strlen(boot_command_line) + 1' for
static_command_line, but the strings copied into static_command_line are
extra_command_line and command_line, rather than extra_command_line and
boot_command_line.

When strlen(command_line) > strlen(boot_command_line), static_command_line
will overflow.

This patch just recovers strlen(command_line) which was miss-consolidated
with strlen(boot_command_line) in the commit f5c7310ac73e ("init/main: add
checks for the return value of memblock_alloc*()")

Link: https://lore.kernel.org/all/20240412081733.35925-2-ytcoode@gmail.com/

Fixes: f5c7310ac73e ("init/main: add checks for the return value of memblock_alloc*()")
Cc: stable@vger.kernel.org
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
diff 8f8cd6c0 Mon Feb 26 19:35:46 MST 2024 Changbin Du <changbin.du@huawei.com> modules: wait do_free_init correctly

The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking. It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.

Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling
do_free_init() into a global workqueue instead of relying on it being
called through call_rcu(..., do_free_init), which used to allowed us call
do_free_init() asynchronously after the end of a subsequent grace period.
The move to a global workqueue broke the gaurantees for code which needed
to be sure the do_free_init() would complete with rcu_barrier(). To fix
this callers which used to rely on rcu_barrier() must now instead use
flush_work(&init_free_wq).

Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.

Even worse, the rcu_barrier() can introduce significant delay. Eric
Chanudet reported that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.

[ 0.291444] Freeing unused kernel memory: 5568K
[ 0.402442] Run /sbin/init as init process

With this fix, the above delay can be eliminated.

Link: https://lkml.kernel.org/r/20240227023546.2490667-1-changbin.du@huawei.com
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Xiaoyi Su <suxiaoyi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 8f8cd6c0 Mon Feb 26 19:35:46 MST 2024 Changbin Du <changbin.du@huawei.com> modules: wait do_free_init correctly

The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking. It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.

Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling
do_free_init() into a global workqueue instead of relying on it being
called through call_rcu(..., do_free_init), which used to allowed us call
do_free_init() asynchronously after the end of a subsequent grace period.
The move to a global workqueue broke the gaurantees for code which needed
to be sure the do_free_init() would complete with rcu_barrier(). To fix
this callers which used to rely on rcu_barrier() must now instead use
flush_work(&init_free_wq).

Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.

Even worse, the rcu_barrier() can introduce significant delay. Eric
Chanudet reported that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.

[ 0.291444] Freeing unused kernel memory: 5568K
[ 0.402442] Run /sbin/init as init process

With this fix, the above delay can be eliminated.

Link: https://lkml.kernel.org/r/20240227023546.2490667-1-changbin.du@huawei.com
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Xiaoyi Su <suxiaoyi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 8f8cd6c0 Mon Feb 26 19:35:46 MST 2024 Changbin Du <changbin.du@huawei.com> modules: wait do_free_init correctly

The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking. It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.

Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling
do_free_init() into a global workqueue instead of relying on it being
called through call_rcu(..., do_free_init), which used to allowed us call
do_free_init() asynchronously after the end of a subsequent grace period.
The move to a global workqueue broke the gaurantees for code which needed
to be sure the do_free_init() would complete with rcu_barrier(). To fix
this callers which used to rely on rcu_barrier() must now instead use
flush_work(&init_free_wq).

Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.

Even worse, the rcu_barrier() can introduce significant delay. Eric
Chanudet reported that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.

[ 0.291444] Freeing unused kernel memory: 5568K
[ 0.402442] Run /sbin/init as init process

With this fix, the above delay can be eliminated.

Link: https://lkml.kernel.org/r/20240227023546.2490667-1-changbin.du@huawei.com
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Xiaoyi Su <suxiaoyi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff ad1a4830 Wed May 17 07:10:59 MDT 2023 Arnd Bergmann <arnd@arndb.de> init: consolidate prototypes in linux/init.h

The init/main.c file contains some extern declarations for functions
defined in architecture code, and it defines some other functions that are
called from architecture code with a custom prototype. Both of those
result in warnings with 'make W=1':

init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes]
init/main.c:790:20: error: no previous prototype for 'mem_encrypt_init' [-Werror=missing-prototypes]
init/main.c:792:20: error: no previous prototype for 'poking_init' [-Werror=missing-prototypes]
arch/arm64/kernel/irq.c:122:13: error: no previous prototype for 'init_IRQ' [-Werror=missing-prototypes]
arch/arm64/kernel/time.c:55:13: error: no previous prototype for 'time_init' [-Werror=missing-prototypes]
arch/x86/kernel/process.c:935:13: error: no previous prototype for 'arch_post_acpi_subsys_init' [-Werror=missing-prototypes]
init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes]
kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes]

Add prototypes for all of these in include/linux/init.h or another
appropriate header, and remove the duplicate declarations from
architecture specific code.

[sfr@canb.auug.org.au: declare time_init_early()]
Link: https://lkml.kernel.org/r/20230519124311.5167221c@canb.auug.org.au
Link: https://lkml.kernel.org/r/20230517131102.934196-12-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 514ca14e Mon Apr 17 16:00:05 MDT 2023 ndesaulniers@google.com <ndesaulniers@google.com> start_kernel: Add __no_stack_protector function attribute

Back during the discussion of
commit a9a3ed1eff36 ("x86: Fix early boot crash on gcc-10, third try")
we discussed the need for a function attribute to control the omission
of stack protectors on a per-function basis; at the time Clang had
support for no_stack_protector but GCC did not. This was fixed in
gcc-11. Now that the function attribute is available, let's start using
it.

Callers of boot_init_stack_canary need to use this function attribute
unless they're compiled with -fno-stack-protector, otherwise the canary
stored in the stack slot of the caller will differ upon the call to
boot_init_stack_canary. This will lead to a call to __stack_chk_fail()
then panic.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94722
Link: https://lore.kernel.org/all/20200316130414.GC12561@hirez.programming.kicks-ass.net/
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20230412-no_stackp-v2-1-116f9fe4bbe7@google.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

Signed-off-by: ndesaulniers@google.com <ndesaulniers@google.com>
diff b743852c Tue Feb 21 16:27:48 MST 2023 Paul E. McKenney <paulmck@kernel.org> Allow forcing unconditional bootconfig processing

The BOOT_CONFIG family of Kconfig options allows a bootconfig file
containing kernel boot parameters to be embedded into an initrd or into
the kernel itself. This can be extremely useful when deploying kernels
in cases where some of the boot parameters depend on the kernel version
rather than on the server hardware, firmware, or workload.

Unfortunately, the "bootconfig" kernel parameter must be specified in
order to cause the kernel to look for the embedded bootconfig file,
and it clearly does not help to embed this "bootconfig" kernel parameter
into that file.

Therefore, provide a new BOOT_CONFIG_FORCE Kconfig option that causes the
kernel to act as if the "bootconfig" kernel parameter had been specified.
In other words, kernels built with CONFIG_BOOT_CONFIG_FORCE=y will look
for the embedded bootconfig file even when the "bootconfig" kernel
parameter is omitted. This permits kernel-version-dependent kernel
boot parameters to be embedded into the kernel image without the need to
(for example) update large numbers of boot loaders.

Link: https://lore.kernel.org/all/20230105005838.GA1772817@paulmck-ThinkPad-P17-Gen-1/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <linux-doc@vger.kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
diff 7ec7096b Tue Jan 17 13:46:17 MST 2023 Pasha Tatashin <pasha.tatashin@soleen.com> mm/page_ext: init page_ext early if there are no deferred struct pages

page_ext must be initialized after all struct pages are initialized.
Therefore, page_ext is initialized after page_alloc_init_late(), and can
optionally be initialized earlier via early_page_ext kernel parameter
which as a side effect also disables deferred struct pages.

Allow to automatically init page_ext early when there are no deferred
struct pages in order to be able to use page_ext during kernel boot and
track for example page allocations early.

[pasha.tatashin@soleen.com: fix build with CONFIG_PAGE_EXTENSION=n]
Link: https://lkml.kernel.org/r/20230118155251.2522985-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20230117204617.1553748-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 7ec7096b Tue Jan 17 13:46:17 MST 2023 Pasha Tatashin <pasha.tatashin@soleen.com> mm/page_ext: init page_ext early if there are no deferred struct pages

page_ext must be initialized after all struct pages are initialized.
Therefore, page_ext is initialized after page_alloc_init_late(), and can
optionally be initialized earlier via early_page_ext kernel parameter
which as a side effect also disables deferred struct pages.

Allow to automatically init page_ext early when there are no deferred
struct pages in order to be able to use page_ext during kernel boot and
track for example page allocations early.

[pasha.tatashin@soleen.com: fix build with CONFIG_PAGE_EXTENSION=n]
Link: https://lkml.kernel.org/r/20230118155251.2522985-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20230117204617.1553748-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 941baf6f Thu Sep 08 12:21:54 MDT 2022 Alexey Dobriyan <adobriyan@gmail.com> proc: give /proc/cmdline size

Most /proc files don't have length (in fstat sense). This leads to
inefficiencies when reading such files with APIs commonly found in modern
programming languages. They open file, then fstat descriptor, get st_size
== 0 and either assume file is empty or start reading without knowing
target size.

cat(1) does OK because it uses large enough buffer by default. But naive
programs copy-pasted from SO aren't:

let mut f = std::fs::File::open("/proc/cmdline").unwrap();
let mut buf: Vec<u8> = Vec::new();
f.read_to_end(&mut buf).unwrap();

will result in

openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32
read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32
read(3, "apper/fedora_localhost--live-roo"..., 64) = 64
read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116
read(3, "", 12)

open/stat is OK, lseek looks silly but there are 3 unnecessary reads
because Rust starts with 32 bytes per Vec<u8> and grows from there.

In case of /proc/cmdline, the length is known precisely.

Make variables readonly while I'm at it.

P.S.: I tried to scp /proc/cpuinfo today and got empty file
but this is separate story.

Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 941baf6f Thu Sep 08 12:21:54 MDT 2022 Alexey Dobriyan <adobriyan@gmail.com> proc: give /proc/cmdline size

Most /proc files don't have length (in fstat sense). This leads to
inefficiencies when reading such files with APIs commonly found in modern
programming languages. They open file, then fstat descriptor, get st_size
== 0 and either assume file is empty or start reading without knowing
target size.

cat(1) does OK because it uses large enough buffer by default. But naive
programs copy-pasted from SO aren't:

let mut f = std::fs::File::open("/proc/cmdline").unwrap();
let mut buf: Vec<u8> = Vec::new();
f.read_to_end(&mut buf).unwrap();

will result in

openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32
read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32
read(3, "apper/fedora_localhost--live-roo"..., 64) = 64
read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116
read(3, "", 12)

open/stat is OK, lseek looks silly but there are 3 unnecessary reads
because Rust starts with 32 bytes per Vec<u8> and grows from there.

In case of /proc/cmdline, the length is known precisely.

Make variables readonly while I'm at it.

P.S.: I tried to scp /proc/cpuinfo today and got empty file
but this is separate story.

Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Completed in 410 milliseconds