History log of /linux-master/kernel/time/timer_migration.c
Revision Date Author Comments
# d7ad05c8 05-May-2024 Levi Yun <ppbuk5246@gmail.com>

timers/migration: Prevent out of bounds access on failure

When tmigr_setup_groups() fails the level 0 group allocation, then the
cleanup derefences index -1 of the local stack array.

Prevent this by checking the loop condition first.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Levi Yun <ppbuk5246@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20240506041059.86877-1-ppbuk5246@gmail.com


# 7a96a84b 05-Apr-2024 Anna-Maria Behnsen <anna-maria@linutronix.de>

timers/migration: Return early on deactivation

Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.

But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):

[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = NONE
active = 0 active = NONE
nextevt = T0i, T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
active idle idle idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :

lock(GRP0:0->lock);
timerqueue_del(GRP0:0, T0i);
unlock(GRP0:0->lock);


[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:0, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle

2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):

lock(GRP0:0->lock);
lock(GRP1:0->lock);
evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
in GRP0:0
... update GRP1:0 group event and timerqueue ...
unlock(GRP1:0->lock);
unlock(GRP0:0->lock);

So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.

Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group. Adapt comments accordingly.

Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87cyr49on2.fsf@somnus


# 61f7fdf8 01-Apr-2024 Frederic Weisbecker <frederic@kernel.org>

timers/migration: Fix ignored event due to missing CPU update

When a group event is updated with its expiry unchanged but a different
CPU, that target change may go unnoticed and the event may be propagated
up with a stale CPU value. The following depicts a scenario that has
been actually observed:

[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = TGRP1:0 (T0)
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T0
/ \
0 (T0) 1 (T1)
idle idle

0) The hierarchy has 3 levels. The left part (GRP1:0) is all idle,
including CPU 0 and CPU 1 which have a timer each: T0 and T1. They have
the same expiry value.

[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T0
/ \
0 (T0) 1 (T1)
idle idle

1) The migrator in GRP1:1 handles remotely T0. The event is dequeued
from the top and T0 executed.

[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle

2) The migrator in GRP1:1 fetches the next timer for CPU 0 and finds
none. But it updates the events from its groups, starting with GRP0:0
which now has T1 as its next event. So far so good.

[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle

3) The migrator in GRP1:1 proceeds upward and updates the events in
GRP1:0. The child event TGRP0:0 is found queued with the same expiry
as before. And therefore it is left unchanged. However the target CPU
is not the same but that fact is ignored so TGRP0:0 still points to
CPU 0 when it should point to CPU 1.

[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = TGRP1:0 (T0)
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle

4) The propagation has reached the top level and TGRP1:0, having TGRP0:0
as its first event, also wrongly points to CPU 0. TGRP1:0 is added to
the top level group.

[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle

5) The migrator in GRP1:1 dequeues the next event in top level pointing
to CPU 0. But since it actually doesn't see any real event in CPU 0, it
early returns.

6) T1 is left unhandled until either CPU 0 or CPU 1 wake up.

Some other bad scenario may involve trees with just two levels.

Fix this with unconditionally updating the CPU of the child event before
considering to early return while updating a queued event with an
unchanged expiry value.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/Zg2Ct6M2RJAYHgCB@localhost.localdomain


# f55acb1e 18-Mar-2024 Frederic Weisbecker <frederic@kernel.org>

timers/migration: Fix endless timer requeue after idle interrupts

When a CPU is an idle migrator, but another CPU wakes up before it,
becomes an active migrator and handles the queue, the initial idle
migrator may end up endlessly reprogramming its clockevent, chasing ghost
timers forever such as in the following scenario:

[GRP0:0]
migrator = 0
active = 0
nextevt = T1
/ \
0 1
active idle (T1)

0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is
the active migrator.

[GRP0:0]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1
idle idle (T1)
wakeup = T1

1) CPU 0 is now idle and is therefore the idle migrator. It has
programmed its next timer interrupt to handle T1.

[GRP0:0]
migrator = 1
active = 1
nextevt = KTIME_MAX
/ \
0 1
idle active
wakeup = T1

2) CPU 1 has woken up, it is now active and it has just handled its own
timer T1.

3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote()
realize it is not the migrator anymore. So it early returns without
observing that T1 has been expired already and therefore without
updating its ->wakeup value.

4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns
because it doesn't queue a timer of its own. So ->wakeup is left
unchanged and the next timer is programmed to fire now.

5) goto 3) forever

This results in timer interrupt storms in idle and also in nohz_full (as
observed in rcutorture's TREE07 scenario).

Fix this with forcing a re-evaluation of tmc->wakeup while trying
remote timer handling when the CPU isn't the migrator anymmore. The
check is inherently racy but in the worst case the CPU just races setting
the KTIME_MAX value that a remote expiry also tries to set.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240318230729.15497-2-frederic@kernel.org


# 4b6f4c5a 14-Mar-2024 Frederic Weisbecker <frederic@kernel.org>

timer/migration: Remove buggy early return on deactivation

When a CPU enters into idle and deactivates itself from the timer
migration hierarchy without any global timer of its own to propagate,
the group event of that CPU is set to "ignore" and tmigr_update_events()
accordingly performs an early return without considering timers queued
by other CPUs.

If the hierarchy has a single level, and the CPU is the last one to
enter idle, it will ignore others' global timers, as in the following
layout:

[GRP0:0]
migrator = 0
active = 0
nextevt = T0i
/ \
0 1
active (T0i) idle (T1)

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.

[GRP0:0]
migrator = NONE
active = NONE
nextevt = T0i
/ \
0 1
idle (T0i) idle (T1)

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
unhandled.

This isn't proper to single level hierarchy though. A similar issue,
although slightly different, may arise on multi-level:

[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = NONE
active = 0 active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
active idle idle idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1. The change only propagated up to 1st
level so far.

[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle

2) The change now propagates up to the top. tmigr_update_events() finds
that the child event is ignored and thus removes it. The top level next
event is now T2 which is returned to CPU 0 as its next effective expiry
to take account for as the global idle migrator. However T1 has been
ignored along the way, leaving it unhandled.

Fix those issues with removing the buggy related early return. Ignored
child events must not prevent from evaluating the other events within
the same group.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/ZfOhB9ZByTZcBy4u@lothringen


# 8ca18367 04-Mar-2024 Frederic Weisbecker <frederic@kernel.org>

timer/migration: Fix quick check reporting late expiry

When a CPU is the last active in the hierarchy and it tries to enter
into idle, the quick check looking up the next event towards cpuidle
heuristics may report a too late expiry, such as in the following
scenario:

[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:0, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T0, T1 nextevt = T2
/ \ / \
0 1 2 3
idle idle idle idle

0) The whole system is idle, and CPU 0 was the last migrator. CPU 0 has
a timer (T0), CPU 1 has a timer (T1) and CPU 2 has a timer (T2). The
expire order is T0 < T1 < T2.

[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0(i), T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU0 migrator = NONE
active = CPU0 active = NONE
nextevt = T0(i), T1 nextevt = T2
/ \ / \
0 1 2 3
active idle idle idle

1) CPU 0 becomes active. The (i) means a now ignored timer.

[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU0 migrator = NONE
active = CPU0 active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 1 2 3
active idle idle idle

2) CPU 0 handles remote. No timer actually expired but ignored timers
have been cleaned out and their sibling's timers haven't been
propagated. As a result the top level's next event is T2 and not T1.

3) CPU 0 tries to enter idle without any global timer enqueued and calls
tmigr_quick_check(). The expiry of T2 is returned instead of the
expiry of T1.

When the quick check returns an expiry that is too late, the cpuidle
governor may pick up a C-state that is too deep. This may be result into
undesired CPU wake up latency if the next timer is actually close enough.

Fix this with assuming that expiries aren't sorted top-down while
performing the quick check. Pick up instead the earliest encountered one
while walking up the hierarchy.

7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240305002822.18130-1-frederic@kernel.org


# 36e40df3 22-Feb-2024 Anna-Maria Behnsen <anna-maria@linutronix.de>

timer_migration: Add tracepoints

The timer pull logic needs proper debugging aids. Add tracepoints so the
hierarchical idle machinery can be diagnosed.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240222103403.31923-1-anna-maria@linutronix.de


# 7ee98877 22-Feb-2024 Anna-Maria Behnsen <anna-maria@linutronix.de>

timers: Implement the hierarchical pull model

Placing timers at enqueue time on a target CPU based on dubious heuristics
does not make any sense:

1) Most timer wheel timers are canceled or rearmed before they expire.

2) The heuristics to predict which CPU will be busy when the timer expires
are wrong by definition.

So placing the timers at enqueue wastes precious cycles.

The proper solution to this problem is to always queue the timers on the
local CPU and allow the non pinned timers to be pulled onto a busy CPU at
expiry time.

Therefore split the timer storage into local pinned and global timers:
Local pinned timers are always expired on the CPU on which they have been
queued. Global timers can be expired on any CPU.

As long as a CPU is busy it expires both local and global timers. When a
CPU goes idle it arms for the first expiring local timer. If the first
expiring pinned (local) timer is before the first expiring movable timer,
then no action is required because the CPU will wake up before the first
movable timer expires. If the first expiring movable timer is before the
first expiring pinned (local) timer, then this timer is queued into an idle
timerqueue and eventually expired by another active CPU.

To avoid global locking the timerqueues are implemented as a hierarchy. The
lowest level of the hierarchy holds the CPUs. The CPUs are associated to
groups of 8, which are separated per node. If more than one CPU group
exist, then a second level in the hierarchy collects the groups. Depending
on the size of the system more than 2 levels are required. Each group has a
"migrator" which checks the timerqueue during the tick for remote expirable
timers.

If the last CPU in a group goes idle it reports the first expiring event in
the group up to the next group(s) in the hierarchy. If the last CPU goes
idle it arms its timer for the first system wide expiring timer to ensure
that no timer event is missed.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240222103710.32582-1-anna-maria@linutronix.de