History log of /linux-master/kernel/rcu/srcutree.c
Revision Date Author Comments
# 67050837 26-Dec-2023 Joel Fernandes (Google) <joel@joelfernandes.org>

srcu: Improve comments about acceleration leak

The comments added in commit 1ef990c4b36b ("srcu: No need to
advance/accelerate if no callback enqueued") are a bit confusing.
The comments are describing a scenario for code that was moved and is
no longer the way it was (snapshot after advancing). Improve the code
comments to reflect this and also document why acceleration can never
fail.

Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# c21357e4 03-Oct-2023 Frederic Weisbecker <frederic@kernel.org>

srcu: Explain why callbacks invocations can't run concurrently

If an SRCU barrier is queued while callbacks are running and a new
callbacks invocator for the same sdp were to run concurrently, the
RCU barrier might execute too early. As this requirement is non-obvious,
make sure to keep a record.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>


# 94c55b9e 03-Oct-2023 Frederic Weisbecker <frederic@kernel.org>

srcu: No need to advance/accelerate if no callback enqueued

While in grace period start, there is nothing to accelerate and
therefore no need to advance the callbacks either if no callback is
to be enqueued.

Spare these needless operations in this case.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>


# 20eb4142 03-Oct-2023 Frederic Weisbecker <frederic@kernel.org>

srcu: Remove superfluous callbacks advancing from srcu_gp_start()

Callbacks advancing on SRCU must be performed on two specific places:

1) On enqueue time in order to make room for the acceleration of the
new callback.

2) On invocation time in order to move the callbacks ready to invoke.

Any other callback advancing callsite is needless. Remove the remaining
one in srcu_gp_start().

Co-developed-by: Yong He <zhuangel570@gmail.com>
Signed-off-by: Yong He <zhuangel570@gmail.com>
Co-developed-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>


# 8a77f38b 03-Oct-2023 Frederic Weisbecker <frederic@kernel.org>

srcu: Only accelerate on enqueue time

Acceleration in SRCU happens on enqueue time for each new callback. This
operation is expected not to fail and therefore any similar attempt
from other places shouldn't find any remaining callbacks to accelerate.

Moreover accelerations performed beyond enqueue time are error prone
because rcu_seq_snap() then may return the snapshot for a new grace
period that is not going to be started.

Remove these dangerous and needless accelerations and introduce instead
assertions reporting leaking unaccelerated callbacks beyond enqueue
time.

Co-developed-by: Yong He <alexyonghe@tencent.com>
Signed-off-by: Yong He <alexyonghe@tencent.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Reviewed-by: Like Xu <likexu@tencent.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 4a8e65b0 03-Oct-2023 Frederic Weisbecker <frederic@kernel.org>

srcu: Fix callbacks acceleration mishandling

SRCU callbacks acceleration might fail if the preceding callbacks
advance also fails. This can happen when the following steps are met:

1) The RCU_WAIT_TAIL segment has callbacks (say for gp_num 8) and the
RCU_NEXT_READY_TAIL also has callbacks (say for gp_num 12).

2) The grace period for RCU_WAIT_TAIL is observed as started but not yet
completed so rcu_seq_current() returns 4 + SRCU_STATE_SCAN1 = 5.

3) This value is passed to rcu_segcblist_advance() which can't move
any segment forward and fails.

4) srcu_gp_start_if_needed() still proceeds with callback acceleration.
But then the call to rcu_seq_snap() observes the grace period for the
RCU_WAIT_TAIL segment (gp_num 8) as completed and the subsequent one
for the RCU_NEXT_READY_TAIL segment as started
(ie: 8 + SRCU_STATE_SCAN1 = 9) so it returns a snapshot of the
next grace period, which is 16.

5) The value of 16 is passed to rcu_segcblist_accelerate() but the
freshly enqueued callback in RCU_NEXT_TAIL can't move to
RCU_NEXT_READY_TAIL which already has callbacks for a previous grace
period (gp_num = 12). So acceleration fails.

6) Note in all these steps, srcu_invoke_callbacks() hadn't had a chance
to run srcu_invoke_callbacks().

Then some very bad outcome may happen if the following happens:

7) Some other CPU races and starts the grace period number 16 before the
CPU handling previous steps had a chance. Therefore srcu_gp_start()
isn't called on the latter sdp to fix the acceleration leak from
previous steps with a new pair of call to advance/accelerate.

8) The grace period 16 completes and srcu_invoke_callbacks() is finally
called. All the callbacks from previous grace periods (8 and 12) are
correctly advanced and executed but callbacks in RCU_NEXT_READY_TAIL
still remain. Then rcu_segcblist_accelerate() is called with a
snaphot of 20.

9) Since nothing started the grace period number 20, callbacks stay
unhandled.

This has been reported in real load:

[3144162.608392] INFO: task kworker/136:12:252684 blocked for more
than 122 seconds.
[3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1
[3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[3144162.631162] kworker/136:12 D 0 252684 2 0x90004000
[3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
[3144162.631192] Call Trace:
[3144162.631202] __schedule+0x2ee/0x660
[3144162.631206] schedule+0x33/0xa0
[3144162.631209] schedule_timeout+0x1c4/0x340
[3144162.631214] ? update_load_avg+0x82/0x660
[3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30
[3144162.631218] wait_for_completion+0x119/0x180
[3144162.631220] ? wake_up_q+0x80/0x80
[3144162.631224] __synchronize_srcu.part.19+0x81/0xb0
[3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10
[3144162.631227] synchronize_srcu+0x5f/0xc0
[3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm]
[3144162.631239] ? __schedule+0x2f6/0x660
[3144162.631243] process_one_work+0x19a/0x3a0
[3144162.631244] worker_thread+0x37/0x3a0
[3144162.631247] kthread+0x117/0x140
[3144162.631247] ? process_one_work+0x3a0/0x3a0
[3144162.631248] ? __kthread_cancel_work+0x40/0x40
[3144162.631250] ret_from_fork+0x1f/0x30

Fix this with taking the snapshot for acceleration _before_ the read
of the current grace period number.

The only side effect of this solution is that callbacks advancing happen
then _after_ the full barrier in rcu_seq_snap(). This is not a problem
because that barrier only cares about:

1) Ordering accesses of the update side before call_srcu() so they don't
bleed.
2) See all the accesses prior to the grace period of the current gp_num

The only things callbacks advancing need to be ordered against are
carried by snp locking.

Reported-by: Yong He <alexyonghe@tencent.com>
Co-developed-by:: Yong He <alexyonghe@tencent.com>
Signed-off-by: Yong He <alexyonghe@tencent.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Link: http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
Fixes: da915ad5cf25 ("srcu: Parallelize callback handling")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# d8d5b7bf 04-Sep-2023 Denis Arefev <arefev@swemel.ru>

srcu: Fix srcu_struct node grpmask overflow on 64-bit systems

The value of a bitwise expression 1 << (cpu - sdp->mynode->grplo)
is subject to overflow due to a failure to cast operands to a larger
data type before performing the bitwise operation.

The maximum result of this subtraction is defined by the RCU_FANOUT_LEAF
Kconfig option, which on 64-bit systems defaults to 16 (resulting in a
maximum shift of 15), but which can be set up as high as 64 (resulting
in a maximum shift of 63). A value of 31 can result in sign extension,
resulting in 0xffffffff80000000 instead of the desired 0x80000000.
A value of 32 or greater triggers undefined behavior per the C standard.

This bug has not been known to cause issues because almost all kernels
take the default CONFIG_RCU_FANOUT_LEAF=16. Furthermore, as long as a
given compiler gives a deterministic non-zero result for 1<<N for N>=32,
the code correctly invokes all SRCU callbacks, albeit wasting CPU time
along the way.

This commit therefore substitutes the correct 1UL for the buggy 1.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Denis Arefev <arefev@swemel.ru>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 2cbc482d 04-Aug-2023 Zhen Lei <thunder.leizhen@huawei.com>

rcu: Dump memory object info if callback function is invalid

When a structure containing an RCU callback rhp is (incorrectly) freed
and reallocated after rhp is passed to call_rcu(), it is not unusual for
rhp->func to be set to NULL. This defeats the debugging prints used by
__call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y,
which expect to identify the offending code using the identity of this
function.

And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things
are even worse, as can be seen from this splat:

Unable to handle kernel NULL pointer dereference at virtual address 0
... ...
PC is at 0x0
LR is at rcu_do_batch+0x1c0/0x3b8
... ...
(rcu_do_batch) from (rcu_core+0x1d4/0x284)
(rcu_core) from (__do_softirq+0x24c/0x344)
(__do_softirq) from (__irq_exit_rcu+0x64/0x108)
(__irq_exit_rcu) from (irq_exit+0x8/0x10)
(irq_exit) from (__handle_domain_irq+0x74/0x9c)
(__handle_domain_irq) from (gic_handle_irq+0x8c/0x98)
(gic_handle_irq) from (__irq_svc+0x5c/0x94)
(__irq_svc) from (arch_cpu_idle+0x20/0x3c)
(arch_cpu_idle) from (default_idle_call+0x4c/0x78)
(default_idle_call) from (do_idle+0xf8/0x150)
(do_idle) from (cpu_startup_entry+0x18/0x20)
(cpu_startup_entry) from (0xc01530)

This commit therefore adds calls to mem_dump_obj(rhp) to output some
information, for example:

slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256

This provides the rough size of the memory block and the offset of the
rcu_head structure, which as least provides at least a few clues to help
locate the problem. If the problem is reproducible, additional slab
debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can
provide significantly more information.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# f0a31b26 29-Jul-2023 Joel Fernandes (Google) <joel@joelfernandes.org>

srcu: Fix error handling in init_srcu_struct_fields()

The current error handling in init_srcu_struct_fields() is a bit
inconsistent. If init_srcu_struct_nodes() fails, the function either
returns -ENOMEM or 0 depending on whether ssp->sda_is_static is true or
false. This can make init_srcu_struct_fields() return 0 even if memory
allocation failed!

Simplify the error handling by always returning -ENOMEM if either
init_srcu_struct_nodes() or the per-CPU allocation fails. This makes the
control flow easier to follow and avoids the inconsistent return values.

Add goto labels to avoid duplicating the error cleanup code.

Link: https://lore.kernel.org/r/20230404003508.GA254019@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 754aa642 27-Jan-2023 Joel Fernandes (Google) <joel@joelfernandes.org>

srcu: Clarify comments on memory barrier "E"

There is an smp_mb() named "E" in srcu_flip() immediately before the
increment (flip) of the srcu_struct structure's ->srcu_idx.

The purpose of E is to order the preceding scan's read of lock counters
against the flipping of the ->srcu_idx, in order to prevent new readers
from continuing to use the old ->srcu_idx value, which might needlessly
extend the grace period.

However, this ordering is already enforced because of the control
dependency between the preceding scan and the ->srcu_idx flip.
This control dependency exists because atomic_long_read() is used
to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx,
and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and
->srcu_unlock_count[] counts match. And such a match cannot happen when
there is an in-flight reader that started before the flip (observation
courtesy Mathieu Desnoyers).

The litmus test below (courtesy of Frederic Weisbecker, with changes
for ctrldep by Boqun and Joel) shows this:

C srcu
(*
* bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip.
*
* So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo
* (control deps) on both sides, and both P0 and P1 are interconnected by ->rf
* relations. Combining the ->ppo with ->rf, a cycle is impossible.
*)

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int lock1;
int unlock1;
int lock0;
int unlock0;

// SCAN1
unlock1 = READ_ONCE(*UNLOCK1);
smp_mb(); // A
lock1 = READ_ONCE(*LOCK1);

// FLIP
if (lock1 == unlock1) { // Control dep
smp_mb(); // E // Remove E and still passes.
WRITE_ONCE(*IDX, 1);
smp_mb(); // D

// SCAN2
unlock0 = READ_ONCE(*UNLOCK0);
smp_mb(); // A
lock0 = READ_ONCE(*LOCK0);
}
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int tmp;
int idx1;
int idx2;

// 1st reader
idx1 = READ_ONCE(*IDX);
if (idx1 == 0) { // Control dep
tmp = READ_ONCE(*LOCK0);
WRITE_ONCE(*LOCK0, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK0);
WRITE_ONCE(*UNLOCK0, tmp + 1);
} else {
tmp = READ_ONCE(*LOCK1);
WRITE_ONCE(*LOCK1, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK1);
WRITE_ONCE(*UNLOCK1, tmp + 1);
}
}

exists (0:lock1=1 /\ 1:idx1=1)

More complicated litmus tests with multiple SRCU readers also show that
memory barrier E is not needed.

This commit therefore clarifies the comment on memory barrier E.

Why not also remove that redundant smp_mb()?

Because control dependencies are quite fragile due to their not being
recognized by most compilers and tools. Control dependencies therefore
exact an ongoing maintenance burden, and such a burden cannot be justified
in this slowpath. Therefore, that smp_mb() stays until such time as
its overhead becomes a measurable problem in a real workload running on
a real production system, or until such time as compilers start paying
attention to this sort of control dependency.

Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# cefc0a59 18-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Fix long lines in srcu_funnel_gp_start()

This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6c366522 18-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Fix long lines in srcu_gp_end()

This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5ff8319f 18-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Fix long lines in cleanup_srcu_struct()

This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# eabe7625 18-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Fix long lines in srcu_get_delay()

This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.

Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a7bf4d7c 24-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Check for readers at module-exit time

If a given statically allocated in-module srcu_struct structure was ever
used for updates, srcu_module_going() will invoke cleanup_srcu_struct()
at module-exit time. This will check for the error case of SRCU readers
persisting past module-exit time. On the other hand, if this srcu_struct
structure never went through a grace period, srcu_module_going() only
invokes free_percpu(), which would result in strange failures if SRCU
readers persisted past module-exit time.

This commit therefore adds a srcu_readers_active() check to
srcu_module_going(), splatting if readers have persisted and refraining
from invoking free_percpu() in that case. Better to leak memory than
to suffer silent memory corruption!

[ paulmck: Apply Zhang, Qiang1 feedback on memory leak. ]

Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# fd1b3f8e 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move work-scheduling fields from srcu_struct to srcu_usage

This commit moves the ->reschedule_jiffies, ->reschedule_count, and
->work fields from the srcu_struct structure to the srcu_usage structure
to reduce the size of the former in order to improve cache locality.

However, this means that the container_of() calls cannot get a pointer
to the srcu_struct because they are no longer in the srcu_struct.
This issue is addressed by adding a ->srcu_ssp field in the srcu_usage
structure that references the corresponding srcu_struct structure.
And given the presence of the sup pointer to the srcu_usage structure,
replace some ssp->srcu_usage-> instances with sup->.

[ paulmck Apply feedback from kernel test robot. ]

Link: https://lore.kernel.org/oe-kbuild-all/202303191400.iO5BOqka-lkp@intel.com/
Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d20162e0 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move srcu_barrier() fields from srcu_struct to srcu_usage

This commit moves the ->srcu_barrier_seq, ->srcu_barrier_mutex,
->srcu_barrier_completion, and ->srcu_barrier_cpu_cnt fields from the
srcu_struct structure to the srcu_usage structure to reduce the size of
the former in order to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 660349ac 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move ->sda_is_static from srcu_struct to srcu_usage

This commit moves the ->sda_is_static field from the srcu_struct structure
to the srcu_usage structure to reduce the size of the former in order
to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3b46679c 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move heuristics fields from srcu_struct to srcu_usage

This commit moves the ->srcu_size_jiffies, ->srcu_n_lock_retries,
and ->srcu_n_exp_nodelay fields from the srcu_struct structure to the
srcu_usage structure to reduce the size of the former in order to improve
cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 03200b5c 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move grace-period fields from srcu_struct to srcu_usage

This commit moves the ->srcu_gp_seq, ->srcu_gp_seq_needed,
->srcu_gp_seq_needed_exp, ->srcu_gp_start, and ->srcu_last_gp_end fields
from the srcu_struct structure to the srcu_usage structure to reduce
the size of the former in order to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e3a6ab25 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move ->srcu_gp_mutex from srcu_struct to srcu_usage

This commit moves the ->srcu_gp_mutex field from the srcu_struct structure
to the srcu_usage structure to reduce the size of the former in order
to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b3fb11f7 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move ->lock from srcu_struct to srcu_usage

This commit moves the ->lock field from the srcu_struct structure to
the srcu_usage structure to reduce the size of the former in order to
improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 0839ade9 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move ->lock initialization after srcu_usage allocation

Currently, both __init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=y kernels
and init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=n kernel initialize
the srcu_struct structure's ->lock before the srcu_usage structure has
been allocated. This of course prevents the ->lock from being moved
to the srcu_usage structure, so this commit moves the initialization
into the init_srcu_struct_fields() after the srcu_usage structure has
been allocated.

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 574dc1a7 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move ->srcu_cb_mutex from srcu_struct to srcu_usage

This commit moves the ->srcu_cb_mutex field from the srcu_struct structure
to the srcu_usage structure to reduce the size of the former in order
to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a0d8cbd3 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move ->srcu_size_state from srcu_struct to srcu_usage

This commit moves the ->srcu_size_state field from the srcu_struct
structure to the srcu_usage structure to reduce the size of the former
in order to improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 208f41b1 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Move ->level from srcu_struct to srcu_usage

This commit moves the ->level[] array from the srcu_struct structure to
the srcu_usage structure to reduce the size of the former in order to
improve cache locality.

Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 95433f72 16-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Begin offloading srcu_struct fields to srcu_update

The current srcu_struct structure is on the order of 200 bytes in size
(depending on architecture and .config), which is much better than the
old-style 26K bytes, but still all too inconvenient when one is trying
to achieve good cache locality on a fastpath involving SRCU readers.

However, only a few fields in srcu_struct are used by SRCU readers.
The remaining fields could be offloaded to a new srcu_update
structure, thus shrinking the srcu_struct structure down to a few
tens of bytes. This commit begins this noble quest, a quest that is
complicated by open-coded initialization of the srcu_struct within the
srcu_notifier_head structure. This complication is addressed by updating
the srcu_notifier_head structure's open coding, given that there does
not appear to be a straightforward way of abstracting that initialization.

This commit moves only the ->node pointer to srcu_update. Later commits
will move additional fields.

[ paulmck: Fold in qiang1.zhang@intel.com's memory-leak fix. ]

Link: https://lore.kernel.org/all/20230320055751.4120251-1-qiang1.zhang@intel.com/
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f4d01a25 17-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

srcu: Use static init for statically allocated in-module srcu_struct

Further shrinking the srcu_struct structure is eased by requiring
that in-module srcu_struct structures rely more heavily on static
initialization. In particular, this preserves the property that
a module-load-time srcu_struct initialization can fail only due
to memory-allocation failure of the per-CPU srcu_data structures.
It might also slightly improve robustness by keeping the number of memory
allocations that must succeed down percpu_alloc() call.

This is in preparation for splitting an srcu_usage structure out
of the srcu_struct structure.

[ paulmck: Fold in qiang1.zhang@intel.com feedback. ]

Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f0f44752 12-Jan-2023 Boqun Feng <boqun.feng@gmail.com>

rcu: Annotate SRCU's update-side lockdep dependencies

Although all flavors of RCU readers are annotated correctly with
lockdep as recursive read locks, they do not set the lock_acquire
'check' parameter. This means that RCU read locks are not added to
the lockdep dependency graph, which in turn means that lockdep cannot
detect RCU-based deadlocks. This is not a problem for RCU flavors having
atomic read-side critical sections because context-based annotations can
catch these deadlocks, see for example the RCU_LOCKDEP_WARN() statement
in synchronize_rcu(). But context-based annotations are not helpful
for sleepable RCU, especially given that it is perfectly legal to do
synchronize_srcu(&srcu1) within an srcu_read_lock(&srcu2).

However, we can detect SRCU-based by: (1) Making srcu_read_lock() a
'check'ed recursive read lock and (2) Making synchronize_srcu() a empty
write lock critical section. Even better, with the newly introduced
lock_sync(), we can avoid false positives about irq-unsafe/safe.
This commit therefore makes it so.

Note that NMI-safe SRCU read side critical sections are currently not
annotated, but might be annotated in the future.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ boqun: Add comments for annotation per Waiman's suggestion ]
[ boqun: Fix comment warning reported by Stephen Rothwell ]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# dafc4d16 21-Dec-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Update comment after the index flip

Because there is not guaranteed to be a full memory barrier between
the ->srcu_unlock_count increment of an srcu_read_unlock() and the
->srcu_lock_count increment of the next srcu_read_lock(), this next
srcu_read_lock() is not guaranteed to see the effect of the index flip
just prior to this comment. However, this next srcu_read_lock() will
execute a full memory barrier, so the srcu_read_lock() after that is
guaranteed to see that index flip.

This guarantee is illustrated by the following diagram of events and
the litmus test following that.

------------------------------------------------------------------------

READER UPDATER
------------- ----------
// idx is initially 0.

srcu_flip() {
smp_mb();
// RSCS

srcu_read_unlock() {
smp_mb();
idx++; // P
smp_mb(); // QQ
}

srcu_readers_unlock_idx(0) {
,--counted------------ count all unlock[0]; // Q
|
unlock[0]++; // X

}
smp_mb();
srcu_read_lock() {
READ(idx) = 0; ,---- count all lock[0]; // contributes imbalance of 1.
lock[0]++; ----counted |
smp_mb(); // PP } |
} |
|
// RSCS not going to effect above scan
|
srcu_read_unlock() { |
smp_mb(); |
unlock[0]++; |
} |
/
/
srcu_read_lock() { |
READ(idx); // Y -----cannot be counted because of P (has to sample idx as 1)
lock[1]++;
...
}

------------------------------------------------------------------------

This makes it similar to the store buffer pattern. Using X, Y, P and Q
annotated above, we get:

------------------------------------------------------------------------

READER UPDATER
X (write) P (write)

smp_mb(); //PP smp_mb(); //QQ

Y (read) Q (read)

------------------------------------------------------------------------

ASCII art courtesy of Joel Fernandes.

Reported-by: Joel Fernandes <joel@joelfernandes.org>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 0cd4b50b 14-Dec-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Yet more detail for srcu_readers_active_idx_check() comments

The comment in srcu_readers_active_idx_check() following the smp_mb()
is out of date, hailing from a simpler time when preemption was disabled
across the bulk of __srcu_read_lock(). The fact that preemption was
disabled meant that the number of tasks that had fetched the old index
but not yet incremented counters was limited by the number of CPUs.

In our more complex modern times, the number of CPUs is no longer a limit.
This commit therefore updates this comment, additionally giving more
memory-ordering detail.

[ paulmck: Apply Nt->Nc feedback from Joel Fernandes. ]

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Reported-by: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Reported-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 1bafbfb3 23-Nov-2022 Pingfan Liu <kernelfans@gmail.com>

srcu: Remove needless rcu_seq_done() check while holding read lock

The srcu_gp_start_if_needed() function now read-holds the srcu_struct
whose grace period is being started, which means that the corresponding
SRCU grace period cannot end. This in turn means that the SRCU
grace-period sequence number returned by rcu_seq_snap() cannot expire
during this time. And that means that the calls to rcu_seq_done() in
srcu_funnel_exp_start() and srcu_funnel_gp_start() can never return true.

This commit therefore removes these rcu_seq_done() checks, but adds checks
in kernels built with CONFIG_PROVE_RCU=y that splats if rcu_seq_done()
does somehow return true.

[ paulmck: Rearrange checks to handle kernels built with lockdep. ]

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: rcu@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 50be0c04 15-Nov-2022 Pingfan Liu <kernelfans@gmail.com>

srcu: Fix the comparision in srcu_invl_snp_seq()

A grace-period sequence number contains two fields: counter and
state. SRCU_SNP_INIT_SEQ provides a guaranteed invalid value for
grace-period sequence numbers in newly allocated srcu_node structures'
->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp fields. The point of the
comparison in srcu_invl_snp_seq() is not to detect invalid grace-period
sequence numbers in general, but rather to detect a newly allocated
srcu_node structure whose ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp
fields need to be brought into line with the srcu_struct structure's
->srcu_gp_seq field.

This commit therefore causes srcu_invl_snp_seq() to compare both fields
of the specified grace-period sequence number.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: <rcu@vger.kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7f24626d 30-Oct-2022 Pingfan Liu <kernelfans@gmail.com>

srcu: Delegate work to the boot cpu if using SRCU_SIZE_SMALL

Commit 994f706872e6 ("srcu: Make Tree SRCU able to operate without
snp_node array") assumes that cpu 0 is always online. However, there
really are situations when some other CPU is the boot CPU, for example,
when booting a kdump kernel with the maxcpus=1 boot parameter.

On PowerPC, the kdump kernel can hang as follows:
...
[ 1.740036] systemd[1]: Hostname set to <xyz.com>
[ 243.686240] INFO: task systemd:1 blocked for more than 122 seconds.
[ 243.686264] Not tainted 6.1.0-rc1 #1
[ 243.686272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.686281] task:systemd state:D stack:0 pid:1 ppid:0 flags:0x00042000
[ 243.686296] Call Trace:
[ 243.686301] [c000000016657640] [c000000016657670] 0xc000000016657670 (unreliable)
[ 243.686317] [c000000016657830] [c00000001001dec0] __switch_to+0x130/0x220
[ 243.686333] [c000000016657890] [c000000010f607b8] __schedule+0x1f8/0x580
[ 243.686347] [c000000016657940] [c000000010f60bb4] schedule+0x74/0x140
[ 243.686361] [c0000000166579b0] [c000000010f699b8] schedule_timeout+0x168/0x1c0
[ 243.686374] [c000000016657a80] [c000000010f61de8] __wait_for_common+0x148/0x360
[ 243.686387] [c000000016657b20] [c000000010176bb0] __flush_work.isra.0+0x1c0/0x3d0
[ 243.686401] [c000000016657bb0] [c0000000105f2768] fsnotify_wait_marks_destroyed+0x28/0x40
[ 243.686415] [c000000016657bd0] [c0000000105f21b8] fsnotify_destroy_group+0x68/0x160
[ 243.686428] [c000000016657c40] [c0000000105f6500] inotify_release+0x30/0xa0
[ 243.686440] [c000000016657cb0] [c0000000105751a8] __fput+0xc8/0x350
[ 243.686452] [c000000016657d00] [c00000001017d524] task_work_run+0xe4/0x170
[ 243.686464] [c000000016657d50] [c000000010020e94] do_notify_resume+0x134/0x140
[ 243.686478] [c000000016657d80] [c00000001002eb18] interrupt_exit_user_prepare_main+0x198/0x270
[ 243.686493] [c000000016657de0] [c00000001002ec60] syscall_exit_prepare+0x70/0x180
[ 243.686505] [c000000016657e10] [c00000001000bf7c] system_call_vectored_common+0xfc/0x280
[ 243.686520] --- interrupt: 3000 at 0x7fffa47d5ba4
[ 243.686528] NIP: 00007fffa47d5ba4 LR: 0000000000000000 CTR: 0000000000000000
[ 243.686538] REGS: c000000016657e80 TRAP: 3000 Not tainted (6.1.0-rc1)
[ 243.686548] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR: 42044440 XER: 00000000
[ 243.686572] IRQMASK: 0
[ 243.686572] GPR00: 0000000000000006 00007ffffa606710 00007fffa48e7200 0000000000000000
[ 243.686572] GPR04: 0000000000000002 000000000000000a 0000000000000000 0000000000000001
[ 243.686572] GPR08: 000001000c172dd0 0000000000000000 0000000000000000 0000000000000000
[ 243.686572] GPR12: 0000000000000000 00007fffa4ff4bc0 0000000000000000 0000000000000000
[ 243.686572] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 243.686572] GPR20: 0000000132dfdc50 000000000000000e 0000000000189375 0000000000000000
[ 243.686572] GPR24: 00007ffffa606ae0 0000000000000005 000001000c185490 000001000c172570
[ 243.686572] GPR28: 000001000c172990 000001000c184850 000001000c172e00 00007fffa4fedd98
[ 243.686683] NIP [00007fffa47d5ba4] 0x7fffa47d5ba4
[ 243.686691] LR [0000000000000000] 0x0
[ 243.686698] --- interrupt: 3000
[ 243.686708] INFO: task kworker/u16:1:24 blocked for more than 122 seconds.
[ 243.686717] Not tainted 6.1.0-rc1 #1
[ 243.686724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.686733] task:kworker/u16:1 state:D stack:0 pid:24 ppid:2 flags:0x00000800
[ 243.686747] Workqueue: events_unbound fsnotify_mark_destroy_workfn
[ 243.686758] Call Trace:
[ 243.686762] [c0000000166736e0] [c00000004fd91000] 0xc00000004fd91000 (unreliable)
[ 243.686775] [c0000000166738d0] [c00000001001dec0] __switch_to+0x130/0x220
[ 243.686788] [c000000016673930] [c000000010f607b8] __schedule+0x1f8/0x580
[ 243.686801] [c0000000166739e0] [c000000010f60bb4] schedule+0x74/0x140
[ 243.686814] [c000000016673a50] [c000000010f699b8] schedule_timeout+0x168/0x1c0
[ 243.686827] [c000000016673b20] [c000000010f61de8] __wait_for_common+0x148/0x360
[ 243.686840] [c000000016673bc0] [c000000010210840] __synchronize_srcu.part.0+0xa0/0xe0
[ 243.686855] [c000000016673c30] [c0000000105f2c64] fsnotify_mark_destroy_workfn+0xc4/0x1a0
[ 243.686868] [c000000016673ca0] [c000000010174ea8] process_one_work+0x2a8/0x570
[ 243.686882] [c000000016673d40] [c000000010175208] worker_thread+0x98/0x5e0
[ 243.686895] [c000000016673dc0] [c0000000101828d4] kthread+0x124/0x130
[ 243.686908] [c000000016673e10] [c00000001000cd40] ret_from_kernel_thread+0x5c/0x64
[ 366.566274] INFO: task systemd:1 blocked for more than 245 seconds.
[ 366.566298] Not tainted 6.1.0-rc1 #1
[ 366.566305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 366.566314] task:systemd state:D stack:0 pid:1 ppid:0 flags:0x00042000
[ 366.566329] Call Trace:
...

The above splat occurs because PowerPC really does use maxcpus=1
instead of nr_cpus=1 in the kernel command line. Consequently, the
(quite possibly non-zero) kdump CPU is the only online CPU in the kdump
kernel. SRCU unconditionally queues a sdp->work on cpu 0, for which no
worker thread has been created, so sdp->work will be never executed and
__synchronize_srcu() will never be completed.

This commit therefore replaces CPU ID 0 with get_boot_cpu_id() in key
places in Tree SRCU. Since the CPU indicated by get_boot_cpu_id()
is guaranteed to be online, this avoids the above splat.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: rcu@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e29a4915 13-Oct-2022 Frederic Weisbecker <frederic@kernel.org>

srcu: Debug NMI safety even on archs that don't require it

Currently the NMI safety debugging is only performed on architectures
that don't support NMI-safe this_cpu_inc().

Reorder the code so that other architectures like x86 also detect bad
uses.

[ paulmck: Apply kernel test robot, Stephen Rothwell, and Zqiang feedback. ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ae3c0706 13-Oct-2022 Frederic Weisbecker <frederic@kernel.org>

srcu: Explain the reason behind the read side critical section on GP start

Tell about the need to protect against concurrent updaters who may
overflow the GP counter behind the current update.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6b77bb9b 13-Oct-2022 Frederic Weisbecker <frederic@kernel.org>

srcu: Warn when NMI-unsafe API is used in NMI

Using the NMI-unsafe reader API from within an NMI handler is very likely
to be buggy for three reasons:

1) NMIs aren't strictly re-entrant (a pending nested NMI will execute at
the end of the current one) so it should be fine to use a non-atomic
increment here. However, breakpoints can still interrupt NMIs and if
a breakpoint callback has a reader on that same ssp, a racy increment
can happen.

2) If the only reader site for a given srcu_struct structure is in an
NMI handler, then RCU should be used instead of SRCU.

3) Because of the previous reason (2), an srcu_struct structure having
an SRCU read side critical section in an NMI handler is likely to
have another one from a task context.

For all these reasons, warn if an NMI-unsafe reader API is used from an
NMI handler.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 36f65f1d 20-Sep-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Check for consistent global per-srcu_struct NMI safety

This commit adds runtime checks to verify that a given srcu_struct uses
consistent NMI-safe (or not) read-side primitives globally, but based
on the per-CPU data. These global checks are made by the grace-period
code that must scan the srcu_data structures anyway, and are done only
in kernels built with CONFIG_PROVE_RCU=y.

Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>


# 27120e7d 19-Sep-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Check for consistent per-CPU per-srcu_struct NMI safety

This commit adds runtime checks to verify that a given srcu_struct uses
consistent NMI-safe (or not) read-side primitives on a per-CPU basis.

Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>


# 2e83b879 15-Sep-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Create an srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()

On strict load-store architectures, the use of this_cpu_inc() by
srcu_read_lock() and srcu_read_unlock() is not NMI-safe in TREE SRCU.
To see this suppose that an NMI arrives in the middle of srcu_read_lock(),
just after it has read ->srcu_lock_count, but before it has written
the incremented value back to memory. If that NMI handler also does
srcu_read_lock() and srcu_read_lock() on that same srcu_struct structure,
then upon return from that NMI handler, the interrupted srcu_read_lock()
will overwrite the NMI handler's update to ->srcu_lock_count, but
leave unchanged the NMI handler's update by srcu_read_unlock() to
->srcu_unlock_count.

This can result in a too-short SRCU grace period, which can in turn
result in arbitrary memory corruption.

If the NMI handler instead interrupts the srcu_read_unlock(), this
can result in eternal SRCU grace periods, which is not much better.

This commit therefore creates a pair of new srcu_read_lock_nmisafe()
and srcu_read_unlock_nmisafe() functions, which allow SRCU readers in
both NMI handlers and in process and IRQ context. It is bad practice
to mix the existing and the new _nmisafe() primitives on the same
srcu_struct structure. Use one set or the other, not both.

Just to underline that "bad practice" point, using srcu_read_lock() at
process level and srcu_read_lock_nmisafe() in your NMI handler will not,
repeat NOT, work. If you do not immediately understand why this is the
case, please review the earlier paragraphs in this commit log.

[ paulmck: Apply kernel test robot feedback. ]
[ paulmck: Apply feedback from Randy Dunlap. ]
[ paulmck: Apply feedback from John Ogness. ]
[ paulmck: Apply feedback from Frederic Weisbecker. ]

Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>


# 5d0f5953 15-Sep-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Convert ->srcu_lock_count and ->srcu_unlock_count to atomic

NMI-safe variants of srcu_read_lock() and srcu_read_unlock() are needed
by printk(), which on many architectures entails read-modify-write
atomic operations. This commit prepares Tree SRCU for this change by
making both ->srcu_lock_count and ->srcu_unlock_count by atomic_long_t.

[ paulmck: Apply feedback from John Ogness. ]

Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>


# 4f2bfd94 30-Jun-2022 Neeraj Upadhyay <quic_neeraju@quicinc.com>

srcu: Make expedited RCU grace periods block even less frequently

The purpose of commit 282d8998e997 ("srcu: Prevent expedited GPs
and blocking readers from consuming CPU") was to prevent a long
series of never-blocking expedited SRCU grace periods from blocking
kernel-live-patching (KLP) progress. Although it was successful, it also
resulted in excessive boot times on certain embedded workloads running
under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive"
means increasing the boot time up into the three-to-four minute range.
This increase in boot time was due to the more than 6000 back-to-back
invocations of synchronize_rcu_expedited() within the KVM host OS, which
in turn resulted from qemu's emulation of a long series of MMIO accesses.

Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace
periods") did not significantly help this particular use case.

Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the
value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values
of non-sleeping per phase counts on a system with preemption enabled,
and observed the following boot times:

+──────────────────────────+────────────────+
| SRCU_MAX_NODELAY_PHASE | Boot time (s) |
+──────────────────────────+────────────────+
| 100 | 30.053 |
| 150 | 25.151 |
| 200 | 20.704 |
| 250 | 15.748 |
| 500 | 11.401 |
| 1000 | 11.443 |
| 10000 | 11.258 |
| 1000000 | 11.154 |
+──────────────────────────+────────────────+

Analysis on the experiment results show additional improvements with
CPU-bound delays approaching one jiffy in duration. This improvement was
also seen when number of per-phase iterations were scaled to one jiffy.

This commit therefore scales per-grace-period phase number of non-sleeping
polls so that non-sleeping polls extend for about one jiffy. In addition,
the delay-calculation call to srcu_get_delay() in srcu_gp_end() is
replaced with a simple check for an expedited grace period. This change
schedules callback invocation immediately after expedited grace periods
complete, which results in greatly improved boot times. Testing done
by Marc and Zhangfei confirms that this change recovers most of the
performance degradation in boottime; for CONFIG_HZ_250 configuration,
specifically, boot times improve from 3m50s to 41s on Marc's setup;
and from 2m40s to ~9.7s on Zhangfei's setup.

In addition to the changes to default per phase delays, this
change adds 3 new kernel parameters - srcutree.srcu_max_nodelay,
srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay.
This allows users to configure the srcu grace period scanning delays in
order to more quickly react to additional use cases.

Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods")
Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: yueluck <yueluck@163.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8f870e6e 12-Jun-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Block less aggressively for expedited grace periods

Commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers
from consuming CPU") fixed a problem where a long-running expedited SRCU
grace period could block kernel live patching. It did so by giving up
on expediting once a given SRCU expedited grace period grew too old.

Unfortunately, this added excessive delays to boots of virtual embedded
systems specifying "-bios QEMU_EFI.fd" to qemu. This commit therefore
makes the transition away from expediting less aggressive, increasing
the per-grace-period phase number of non-sleeping polls of readers from
one to three and increasing the required grace-period age from one jiffy
(actually from zero to one jiffies) to two jiffies (actually from one
to two jiffies).

Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: chenxiang (M)" <chenxiang66@hisilicon.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/


# 586e31d5 15-Mar-2022 Lukas Bulwahn <lukas.bulwahn@gmail.com>

srcu: Drop needless initialization of sdp in srcu_gp_start()

Commit 9c7ef4c30f12 ("srcu: Make Tree SRCU able to operate without
snp_node array") initializes the local variable sdp differently depending
on the srcu's state in srcu_gp_start(). Either way, this initialization
overwrites the value used when sdp is defined.

This commit therefore drops this pointless definition-time initialization.
Although there is no functional change, compiler code generation may
be affected.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 282d8998 08-Mar-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Prevent expedited GPs and blocking readers from consuming CPU

If an SRCU reader blocks while a synchronize_srcu_expedited() waits for
that same reader, then that grace period will spawn an endless series of
workqueue handlers, consuming a full CPU. This quickly gets pointless
because consuming more CPU isn't going to make that reader get done
faster, especially if it is blocked waiting for an external event.

This commit therefore spawns at most one pair of back-to-back workqueue
handlers per expedited grace period phase, instead inserting increasing
delays as that grace period phase grows older, but capped at 10 jiffies.
In any case, if there have been at least 100 back-to-back workqueue
handlers within a single jiffy, regardless of grace period or grace-period
phase, then a one-jiffy delay is inserted.

[ paulmck: Apply feedback from kernel test robot. ]

Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reported-by: Song Liu <song@kernel.org>
Tested-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c2445d38 31-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Add contention check to call_srcu() srcu_data ->lock acquisition

This commit increases the sensitivity of contention detection by adding
checks to the acquisition of the srcu_data structure's lock on the
call_srcu() code path.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a57ffb3c 31-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Automatically determine size-transition strategy at boot

This commit adds a srcutree.convert_to_big option of zero that causes
SRCU to decide at boot whether to wait for contention (small systems) or
immediately expand to large (large systems). A new srcutree.big_cpu_lim
(defaulting to 128) defines how many CPUs constitute a large system.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 9f2e91d9 27-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Add contention-triggered addition of srcu_node tree

This commit instruments the acquisitions of the srcu_struct structure's
->lock, enabling the initiation of a transition from SRCU_SIZE_SMALL
to SRCU_SIZE_BIG when sufficient contention is experienced. The
instrumentation counts the number of trylock failures within the confines
of a single jiffy. If that number exceeds the value specified by the
srcutree.small_contention_lim kernel boot parameter (which defaults to
100), and if the value specified by the srcutree.convert_to_big kernel
boot parameter has the 0x10 bit set (defaults to 0), then a transition
will be automatically initiated.

By default, there will never be any transitions, so that none of the
srcu_struct structures ever gains an srcu_node array.

The useful values for srcutree.convert_to_big are:

0x00: Never convert.
0x01: Always convert at init_srcu_struct() time.
0x02: Convert when rcutorture prints its first round of statistics.
0x03: Decide conversion approach at boot given system size.
0x10: Convert if contention is encountered.
0x12: Convert if contention is encountered or when rcutorture prints
its first round of statistics, whichever comes first.

The value 0x11 acts the same as 0x01 because the conversion happens
before there is any chance of contention.

[ paulmck: Apply "static" feedback from kernel test robot. ]

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 99659f64 27-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Create concurrency-safe helper for initiating size transition

Once there are contention-initiated size transitions, it will be
possible for rcutorture to initiate a transition at the same time
as a contention-initiated transition. This commit therefore creates
a concurrency-safe helper function named srcu_transition_to_big() to
safely initiate size transitions.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ee5e2448 27-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Explain srcu_funnel_gp_start() call to list_add() is safe

This commit adds a comment explaining why an unprotected call to
list_add() from srcu_funnel_gp_start() can be safe. TL;DR: It is only
called during very early boot when we don't have no steeking concurrency!

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 46470cf8 27-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Prevent cleanup_srcu_struct() from freeing non-dynamic ->sda

When an srcu_struct structure is created (but not in a kernel module)
by DEFINE_SRCU() and friends, the per-CPU srcu_data structure is
statically allocated. In all other cases, that structure is obtained
from alloc_percpu(), in which case cleanup_srcu_struct() must invoke
free_percpu() on the resulting ->sda pointer in the srcu_struct pointer.

Which it does.

Except that it also invokes free_percpu() on the ->sda pointer
referencing the statically allocated per-CPU srcu_data structures.
Which free_percpu() is surprisingly OK with.

This commit nevertheless stops cleanup_srcu_struct() from freeing
statically allocated per-CPU srcu_data structures.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4a230f80 27-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Avoid NULL dereference in srcu_torture_stats_print()

You really shouldn't invoke srcu_torture_stats_print() after invoking
cleanup_srcu_struct(), but there is really no reason to get a
compiler-obfuscated per-CPU-variable NULL pointer dereference as the
diagnostic. This commit therefore checks for NULL ->sda and makes a
more polite console-message complaint in that case.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c69a00a1 25-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Add boot-time control over srcu_node array allocation

This commit adds an srcu_tree.convert_to_big kernel parameter that either
refuses to convert at all (0), converts immediately at init_srcu_struct()
time (1), or lets rcutorture convert it (2). An addition contention-based
dynamic conversion choice will be added, along with documentation.

[ paulmck: Apply callback-scanning feedback from Neeraj Upadhyay. ]

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 0b56f953 21-Feb-2022 Neeraj Upadhyay <quic_neeraju@quicinc.com>

srcu: Ensure snp nodes tree is fully initialized before traversal

For configurations where snp node tree is not initialized at
init time (added in subsequent commits), srcu_funnel_gp_start()
and srcu_funnel_exp_start() can potential traverse and observe
the snp nodes' transient (uninitialized) states. This can potentially
happen, when init_srcu_struct_nodes() initialization of sdp->mynode
races with srcu_funnel_gp_start() and srcu_funnel_exp_start()

Consider the case below where srcu_funnel_gp_start() observes
sdp->mynode to be not NULL and uses an uninitialized sdp->grpmask

P1 P2

init_srcu_struct_nodes() void srcu_funnel_gp_start(...)
{
for_each_possible_cpu(cpu) {
...
sdp->mynode = &snp_first[...];
for (snp = sdp->mynode;...) struct srcu_node *snp_leaf =
smp_load_acquire(&sdp->mynode)
... if (snp_leaf) {
for (snp = snp_leaf; ...)
...
if (snp == snp_leaf)
snp->srcu_data_have_cbs[idx] |=
sdp->grpmask;
sdp->grpmask =
1 << (cpu - sdp->mynode->grplo);
}
}

Similarly, init_srcu_struct_nodes() and srcu_funnel_exp_start() can
race, where srcu_funnel_exp_start() could observe state of snp lock
before spin_lock_init().

P1 P2

init_srcu_struct_nodes() void srcu_funnel_exp_start(...)
{
srcu_for_each_node_breadth_first(ssp, snp) { for (; ...) {
spin_lock_...(snp, )
spin_lock_init(&ACCESS_PRIVATE(snp, lock));
...
}
for_each_possible_cpu(cpu) {
...
sdp->mynode = &snp_first[...];

To avoid these issues, ensure that snp node tree initialization is
complete i.e. after SRCU_SIZE_WAIT_BARRIER srcu_size_state is reached,
before traversing the tree. Given that srcu_funnel_gp_start() and
srcu_funnel_exp_start() are called within SRCU read side critical
sections, this check is safe, in the sense that all callbacks are
enqueued on CPU0 srcu_cblist until SRCU_SIZE_WAIT_CALL is entered,
and these read side critical sections (containing srcu_funnel_gp_start()
and srcu_funnel_exp_start()) need to complete, before SRCU_SIZE_WAIT_CALL
is reached.

Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# cbdc98e9 26-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Use invalid initial value for srcu_node GP sequence numbers

Currently, tree SRCU relies on the srcu_node structures being initialized
at the same time that the srcu_struct itself is initialized, and thus
use the initial grace-period sequence number as the initial value for
the srcu_node structure's ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp
fields. Although this has a high probability of also working when the
srcu_node array is allocated and initialized at some random later time,
it would be better to avoid leaving such things to chance.

This commit therefore initializes these fields with 0x2, which is a
recognizable invalid value. It then adds the required checks for this
invalid value in order to avoid confusion on long-running kernels
(especially those on 32-bit systems) that allocate and initialize
srcu_node arrays late in life.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# aeb9b39b 26-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Compute snp_seq earlier in srcu_funnel_gp_start()

Currently, srcu_funnel_gp_start() tests snp->srcu_have_cbs[idx] and then
separately assigns it to the snp_seq local variable. This commit does
the assignment earlier to simplify the code a bit. While in the area,
this commit also takes advantage of the 100-character line limit to put
the call to srcu_schedule_cbs_sdp() on a single line.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3bedebcf 24-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Make rcutorture dump the SRCU size state

This commit adds the numeric and string version of ->srcu_size_state to
the Tree-SRCU-specific portion of the rcutorture output.

[ paulmck: Apply feedback from kernel test robot and Dan Carpenter. ]
[ quic_neeraju: Apply feedback from Jiapeng Chong. ]

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e2f63836 24-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Add size-state transitioning code

This is just dead code at the moment, and will be used once
the state-transition code is activated.

Because srcu_barrier() must be aware of transition before call_srcu(), the
state machine waits for an SRCU grace period before callbacks are queued
to the non-CPU-0 queues. This requres that portions of srcu_barrier()
be enclosed in an SRCU read-side critical section.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2ec30311 21-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Dynamically allocate srcu_node array

This commit shrinks the srcu_struct structure by converting its ->node
field from a fixed-size compile-time array to a pointer to a dynamically
allocated array. In kernels built with large values of NR_CPUS that boot
on systems with smaller numbers of CPUs, this can save significant memory.

[ paulmck: Apply kernel test robot feedback. ]

Reported-by: A cast of thousands
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 994f7068 24-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Make Tree SRCU able to operate without snp_node array

This commit makes Tree SRCU able to operate without an snp_node
array, that is, when the srcu_data structures' ->mynode pointers
are NULL. This can result in high contention on the srcu_struct
structure's ->lock, but only when there are lots of call_srcu(),
synchronize_srcu(), and synchronize_srcu_expedited() calls.

Note that when there is no snp_node array, all SRCU callbacks use
CPU 0's callback queue. This is optimal in the common case of low
update-side load because it removes the need to search each CPU
for the single callback that made the grace period happen.

Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7b9e9b58 20-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Make srcu_funnel_gp_start() cache ->mynode in snp_leaf

Currently, the srcu_funnel_gp_start() walks its local variable snp up the
tree and reloads sdp->mynode whenever it is necessary to check whether
it is still at the leaf srcu_node level. This works, but is a bit more
obtuse than absolutely necessary. In addition, upcoming commits will
dynamically size srcu_struct structures, in which case sdp->mynode will
no longer necessarily be a constant, and this commit helps prepare for
that dynamic sizing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8ed00760 12-Jan-2022 Paul E. McKenney <paulmck@kernel.org>

srcu: Tighten cleanup_srcu_struct() GP checks

Currently, cleanup_srcu_struct() checks for a grace period in progress,
but it does not check for a grace period that has not yet started but
which might start at any time. Such a situation could result in a
use-after-free bug, so this commit adds a check for a grace period that
is needed but not yet started to cleanup_srcu_struct().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a616aec9 22-Mar-2021 Ingo Molnar <mingo@kernel.org>

rcu: Fix various typos in comments

Fix ~12 single-word typos in RCU code comments.

[ paulmck: Apply feedback from Randy Dunlap. ]
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3d3a0d1b 16-Apr-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Point to documentation of ordering guarantees

Add comments to synchronize_rcu() and friends that point to
Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b5befe84 17-Apr-2021 Frederic Weisbecker <frederic@kernel.org>

srcu: Fix broken node geometry after early ssp init

An srcu_struct structure that is initialized before rcu_init_geometry()
will have its srcu_node hierarchy based on CONFIG_NR_CPUS. Once
rcu_init_geometry() is called, this hierarchy is compressed as needed
for the actual maximum number of CPUs for this system.

Later on, that srcu_struct structure is confused, sometimes referring
to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead
to the new num_possible_cpus() hierarchy. For example, each of its
->mynode fields continues to reference the original leaf rcu_node
structures, some of which might no longer exist. On the other hand,
srcu_for_each_node_breadth_first() traverses to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an srcu_data structure (call it
*sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in
srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf
(say 3 levels).

b) The grace period ends after rcu_init_geometry() shrinks the
nodes level to a single one. srcu_gp_end() walks through the new
srcu_node hierarchy without ever reaching the old leaves so the
callback is never executed.

This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32
and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests
verification never completes and the boot hangs:

[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds.
[ 5413.147564] Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5413.159753] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000
[ 5413.168099] Call Trace:
[ 5413.170555] __schedule+0x36c/0x930
[ 5413.174057] ? wait_for_completion+0x88/0x110
[ 5413.178423] schedule+0x46/0xf0
[ 5413.181575] schedule_timeout+0x284/0x380
[ 5413.185591] ? wait_for_completion+0x88/0x110
[ 5413.189957] ? mark_held_locks+0x61/0x80
[ 5413.193882] ? mark_held_locks+0x61/0x80
[ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173] ? wait_for_completion+0x88/0x110
[ 5413.206535] wait_for_completion+0xb4/0x110
[ 5413.210724] ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610] srcu_barrier+0x187/0x200
[ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244] ? rdinit_setup+0x2b/0x2b
[ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700] do_one_initcall+0x63/0x310
[ 5413.236541] ? rdinit_setup+0x2b/0x2b
[ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912] kernel_init_freeable+0x253/0x28f
[ 5413.249273] ? rest_init+0x250/0x250
[ 5413.252846] kernel_init+0xa/0x110
[ 5413.256257] ret_from_fork+0x22/0x30

2) An srcu_struct structure that is initialized before rcu_init_geometry()
and used afterward will always have stale rdp->mynode references,
resulting in callbacks to be missed in srcu_gp_end(), just like in
the previous scenario.

This commit therefore causes init_srcu_struct_nodes to initialize the
geometry, if needed. This ensures that the srcu_node hierarchy is
properly built and distributed from the get-go.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8e9c01c7 08-Apr-2021 Frederic Weisbecker <frederic@kernel.org>

srcu: Initialize SRCU after timers

Once srcu_init() is called, the SRCU core will make use of delayed
workqueues, which rely on timers. However init_timers() is called
several steps after rcu_init(). This means that a call_srcu() after
rcu_init() but before init_timers() would find itself within a dangerously
uninitialized timer core.

This commit therefore creates a separate call to srcu_init() after
init_timer() completes, which ensures that we stay in early SRCU mode
until timers are safe(r).

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c75e9d29 01-Apr-2021 Frederic Weisbecker <frederic@kernel.org>

srcu: Remove superfluous ssp initialization for early callbacks

Pre-srcu_init() invocations of call_srcu() initialize the srcu_struct
structure in question, so there is no need to check this initialization
in srcu_init() when initiating grace periods for srcu_struct structures
that had early call_srcu() invocations. This commit therefore drops
the calls to check_init_srcu_struct() in srcu_init().

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 94df76a1 01-Apr-2021 Frederic Weisbecker <frederic@kernel.org>

srcu: Remove superfluous sdp->srcu_lock_count zero filling

Because alloc_percpu() zeroes out the allocated memory, there is no need
to zero-fill newly allocated per-CPU memory. This commit therefore removes
the loop zeroing the ->srcu_lock_count and ->srcu_unlock_count arrays
from init_srcu_struct_nodes(). This is the only use of that function's
is_static parameter, which this commit also removes.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ae5c2341 23-Sep-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/segcblist: Add counters to segcblist datastructure

Add counting of segment lengths of segmented callback list.

This will be useful for a number of things such as knowing how big the
ready-to-execute segment have gotten. The immediate benefit is ability
to trace how the callbacks in the segmented callback list change.

Also this patch remove hacks related to using donecbs's ->len field as a
temporary variable to save the segmented callback list's length. This cannot be
done anymore and is not needed.

Also fix SRCU:
The negative counting of the unsegmented list cannot be used to adjust
the segmented one. To fix this, sample the unsegmented length in
advance, and use it after CB execution to adjust the segmented list's
length.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4e7ccfae 15-Nov-2020 Paul E. McKenney <paulmck@kernel.org>

srcu: Add comment explaining cookie overflow/wrap

This commit adds to the poll_state_synchronize_srcu() header comment
describing the issues surrounding SRCU cookie overflow/wrap for the
different kernel configurations.

Link: https://lore.kernel.org/rcu/20201112201547.GF3365678@moria.home.lan/
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5358c9fa 13-Nov-2020 Paul E. McKenney <paulmck@kernel.org>

srcu: Provide polling interfaces for Tree SRCU grace periods

There is a need for a polling interface for SRCU grace
periods, so this commit supplies get_state_synchronize_srcu(),
start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this
purpose. The first can be used if future grace periods are inevitable
(perhaps due to a later call_srcu() invocation), the second if future
grace periods might not otherwise happen, and the third to check if a
grace period has elapsed since the corresponding call to either of the
first two.

As with get_state_synchronize_rcu() and cond_synchronize_rcu(),
the return value from either get_state_synchronize_srcu() or
start_poll_synchronize_srcu() must be passed in to a later call to
poll_state_synchronize_srcu().

Link: https://lore.kernel.org/rcu/20201112201547.GF3365678@moria.home.lan/
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
[ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ]
[ paulmck: Apply feedback from Neeraj Upadhyay. ]
Link: https://lore.kernel.org/lkml/20201117004017.GA7444@paulmck-ThinkPad-P72/
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 29d2bb94 13-Nov-2020 Paul E. McKenney <paulmck@kernel.org>

srcu: Provide internal interface to start a Tree SRCU grace period

There is a need for a polling interface for SRCU grace periods.
This polling needs to initiate an SRCU grace period without having
to queue (and manage) a callback. This commit therefore splits the
Tree SRCU __call_srcu() function into callback-initialization and
queuing/start-grace-period portions, with the latter in a new function
named srcu_gp_start_if_needed(). This function may be passed a NULL
callback pointer, in which case it will refrain from queuing anything.

Why have the new function mess with queuing? Locking considerations,
of course!

Link: https://lore.kernel.org/rcu/20201112201547.GF3365678@moria.home.lan/
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 50edb988 10-Sep-2020 Paul E. McKenney <paulmck@kernel.org>

srcu: Take early exit on memory-allocation failure

It turns out that init_srcu_struct() can be invoked from usermode tasks,
and that fatal signals received by these tasks can cause memory-allocation
failures. These failures are not handled well by init_srcu_struct(),
so much so that NULL pointer dereferences can result. This commit
therefore causes init_srcu_struct() to take an early exit upon detection
of memory-allocation failure.

Link: https://lore.kernel.org/lkml/20200908144306.33355-1-aik@ozlabs.ru/
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f505d434 16-Sep-2020 Jakub Kicinski <kuba@kernel.org>

srcu: Use a more appropriate lockdep helper

The lockdep_is_held() macro is defined as:

#define lockdep_is_held(lock) lock_is_held(&(lock)->dep_map)

This hides away the dereference, so that builds with !LOCKDEP don't break.
This works in current kernels because the RCU_LOCKDEP_WARN() eliminates
its condition at preprocessor time in !LOCKDEP kernels. However, later
patches in this series will cause the compiler to see this condition even
in !LOCKDEP kernels. This commit prepares for this upcoming change by
switching from lock_is_held() to lockdep_is_held().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
CC: jiangshanlai@gmail.com
CC: paulmck@kernel.org
CC: josh@joshtriplett.org
CC: rostedt@goodmis.org
CC: mathieu.desnoyers@efficios.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d9b60741 17-Jun-2020 Paul E. McKenney <paulmck@kernel.org>

srcu: Remove KCSAN stubs

KCSAN is now in mainline, so this commit removes the stubs for the
data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS()
macros.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# bde50d8f 26-May-2020 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

srcu: Avoid local_irq_save() before acquiring spinlock_t

SRCU disables interrupts to get a stable per-CPU pointer and then
acquires the spinlock which is in the per-CPU data structure. The
release uses spin_unlock_irqrestore(). While this is correct on a non-RT
kernel, this conflicts with the RT semantics because the spinlock is
converted to a 'sleeping' spinlock. Sleeping locks can obviously not be
acquired with interrupts disabled.

Acquire the per-CPU pointer `ssp->sda' without disabling preemption and
then acquire the spinlock_t of the per-CPU data structure. The lock will
ensure that the data is consistent.

The added call to check_init_srcu_struct() is now needed because a
statically defined srcu_struct may remain uninitialized until this
point and the newly introduced locking operation requires an initialized
spinlock_t.

This change was tested for four hours with 8*SRCU-N and 8*SRCU-P without
causing any warnings.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7fef6cff 18-Apr-2020 Ethon Paul <ethp@qq.com>

srcu: Fix a typo in comment "amoritized"->"amortized"

This commit fixes a typo in a comment.

Signed-off-by: Ethon Paul <ethp@qq.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b68c6146 03-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

srcu: Add data_race() to ->srcu_lock_count and ->srcu_unlock_count arrays

The srcu_data structure's ->srcu_lock_count and ->srcu_unlock_count arrays
are read and written locklessly, so this commit adds the data_race()
to the diagnostic-print loads from these arrays in order mark them as
known and approved data-racy accesses.

This data race was reported by KCSAN. Not appropriate for backporting due
to failure being unlikely and due to this being used only by rcutorture.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4f58820f 13-Apr-2020 Paul E. McKenney <paulmck@kernel.org>

srcu: Add KCSAN stubs

This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to
move ahead.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 71042606 03-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

srcu: Hold srcu_struct ->lock when updating ->srcu_gp_seq

A read of the srcu_struct structure's ->srcu_gp_seq field should not
need READ_ONCE() when that structure's ->lock is held. Except that this
lock is not always held when updating this field. This commit therefore
acquires the lock around updates and removes a now-unneeded READ_ONCE().

This data race was reported by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Switch from READ_ONCE() to lock per Peter Zilstra question. ]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>


# 39f91504 22-Dec-2019 Paul E. McKenney <paulmck@kernel.org>

srcu: Fix process_srcu()/srcu_batches_completed() datarace

The srcu_struct structure's ->srcu_idx field is accessed locklessly,
so reads must use READ_ONCE(). This commit therefore adds the needed
READ_ONCE() invocation where it was missed.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8c9e0cb3 22-Dec-2019 Paul E. McKenney <paulmck@kernel.org>

srcu: Fix __call_srcu()/srcu_get_delay() datarace

The srcu_struct structure's ->srcu_gp_seq_needed_exp field is accessed
locklessly, so updates must use WRITE_ONCE(). This commit therefore
adds the needed WRITE_ONCE() invocations.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7ff8b450 22-Dec-2019 Paul E. McKenney <paulmck@kernel.org>

srcu: Fix __call_srcu()/process_srcu() datarace

The srcu_node structure's ->srcu_gp_seq_needed_exp field is accessed
locklessly, so updates must use WRITE_ONCE(). This commit therefore
adds the needed WRITE_ONCE() invocations.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 65bb0dc4 06-Jan-2020 SeongJae Park <sjpark@amazon.de>

rcu: Fix typos in file-header comments

Convert to plural and add a note that this is for Tree RCU.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 844a378d 04-Nov-2019 Paul E. McKenney <paulmck@kernel.org>

srcu: Apply *_ONCE() to ->srcu_last_gp_end

The ->srcu_last_gp_end field is accessed from any CPU at any time
by synchronize_srcu(), so non-initialization references need to use
READ_ONCE() and WRITE_ONCE(). This commit therefore makes that change.

Reported-by: syzbot+08f3e9d26e5541e1ecf2@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 77a40f97 29-Aug-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Remove kfree_rcu() special casing and lazy-callback handling

This commit removes kfree_rcu() special-casing and the lazy-callback
handling from Tree RCU. It moves some of this special casing to Tiny RCU,
the removal of which will be the subject of later commits.

This results in a nice negative delta.

Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Add slab.h #include, thanks to kbuild test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7e210a65 28-Jun-2019 Paul E. McKenney <paulmck@kernel.org>

srcu: Avoid srcutorture security-based pointer obfuscation

Because pointer output is now obfuscated, and because what you really
want to know is whether or not the callback lists are empty, this commit
replaces the srcu_data structure's head callback pointer printout with
a single character that is "." is the callback list is empty or "C"
otherwise.

This is the only remaining user of rcu_segcblist_head(), so this
commit also removes this function's definition. It also turns out that
rcu_segcblist_tail() no longer has any callers, so this commit removes
that function's definition while in the area. They were both marked
"Interim", and their end has come.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 11b00045 22-Apr-2019 Jiang Biao <benbjiang@tencent.com>

rcu: Make __call_srcu static

Because __call_srcu() is not used outside kernel/rcu/srcutree.c,
this commit makes it static.

Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# fe15b50c 05-Apr-2019 Paul E. McKenney <paulmck@kernel.org>

srcu: Allocate per-CPU data for DEFINE_SRCU() in modules

Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module requires
that the size of the reserved region be increased, which is not something
we want to be doing all that often. One approach would be to require
that loadable modules define an srcu_struct and invoke init_srcu_struct()
from their module_init function and cleanup_srcu_struct() from their
module_exit function. However, this is more than a bit user unfriendly.

This commit therefore creates an ___srcu_struct_ptrs linker section,
and pointers to srcu_struct structures created by DEFINE_SRCU() and
DEFINE_STATIC_SRCU() within a module are placed into that module's
___srcu_struct_ptrs section. The required init_srcu_struct() and
cleanup_srcu_struct() functions are then automatically invoked as needed
when that module is loaded and unloaded, thus allowing modules to continue
to use DEFINE_SRCU() and DEFINE_STATIC_SRCU() while avoiding the need
to increase the size of the reserved region.

Many of the algorithms and some of the code was cheerfully cherry-picked
from other code making use of linker sections, perhaps most notably from
tracepoints. All bugs are nevertheless the sole property of the author.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ paulmck: Use __section() and use "default" in srcu_module_notify()'s
"switch" statement as suggested by Joel Fernandes. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# f5ad3991 13-Feb-2019 Paul E. McKenney <paulmck@kernel.org>

srcu: Remove cleanup_srcu_struct_quiesced()

The cleanup_srcu_struct_quiesced() function was added because NVME
used WQ_MEM_RECLAIM workqueues and SRCU did not, which meant that
NVME workqueues waiting on SRCU workqueues could result in deadlocks
during low-memory conditions. However, SRCU now also has WQ_MEM_RECLAIM
workqueues, so there is no longer a potential for deadlock. Furthermore,
it turns out to be extremely hard to use cleanup_srcu_struct_quiesced()
correctly due to the fact that SRCU callback invocation accesses the
srcu_struct structure's per-CPU data area just after callbacks are
invoked. Therefore, the usual practice of using srcu_barrier() to wait
for callbacks to be invoked before invoking cleanup_srcu_struct_quiesced()
fails because SRCU's callback-invocation workqueue handler might be
delayed, which can result in cleanup_srcu_struct_quiesced() being invoked
(and thus freeing the per-CPU data) before the SRCU's callback-invocation
workqueue handler is finished using that per-CPU data. Nor is this a
theoretical problem: KASAN emitted use-after-free warnings because of
this problem on actual runs.

In short, NVME can now safely invoke cleanup_srcu_struct(), which
avoids the use-after-free scenario. And cleanup_srcu_struct_quiesced()
is quite difficult to use safely. This commit therefore removes
cleanup_srcu_struct_quiesced(), switching its sole user back to
cleanup_srcu_struct(). This effectively reverts the following pair
of commits:

f7194ac32ca2 ("srcu: Add cleanup_srcu_struct_quiesced()")
4317228ad9b8 ("nvme: Avoid flush dependency in delete controller flow")

Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>


# 5cdfd174 12-Feb-2019 Paul E. McKenney <paulmck@kernel.org>

srcu: Check for in-flight callbacks in _cleanup_srcu_struct()

If someone fails to drain the corresponding SRCU callbacks (for
example, by failing to invoke srcu_barrier()) before invoking either
cleanup_srcu_struct() or cleanup_srcu_struct_quiesced(), the resulting
diagnostic is an ambiguous use-after-free diagnostic, and even then
only if you are running something like KASAN. This commit therefore
improves SRCU diagnostics by adding checks for in-flight callbacks at
_cleanup_srcu_struct() time.

Note that these diagnostics can still be defeated, for example, by
invoking call_srcu() concurrently with cleanup_srcu_struct(). Which is
a really bad idea, but sometimes all too easy to do. But even then,
these diagnostics have at least some probability of catching the problem.

Reported-by: Sagi Grimberg <sagi@grimberg.me>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>


# e7ee1501 17-Jan-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/srcu: Convert to SPDX license identifier

Replace the license boiler plate with a SPDX license identifier.
While in the area, update an email address.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>


# e81baf4c 10-Dec-2018 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

srcu: Remove srcu_queue_delayed_work_on()

srcu_queue_delayed_work_on() disables preemption (and therefore CPU
hotplug in RCU's case) and then checks based on its own accounting if a
CPU is online. If the CPU is online it uses queue_delayed_work_on()
otherwise it fallbacks to queue_delayed_work().
The problem here is that queue_work() on -RT does not work with disabled
preemption.

queue_work_on() works also on an offlined CPU. queue_delayed_work_on()
has the problem that it is possible to program a timer on an offlined
CPU. This timer will fire once the CPU is online again. But until then,
the timer remains programmed and nothing will happen.

Add a local timer which will fire (as requested per delay) on the local
CPU and then enqueue the work on the specific CPU.

RCUtorture testing with SRCU-P for 24h showed no problems.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# aacb5d91 28-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Use "ssp" instead of "sp" for srcu_struct pointer

In RCU, the distinction between "rsp", "rnp", and "rdp" has served well
for a great many years, but in SRCU, "sp" vs. "sdp" has proven confusing.
This commit therefore renames SRCU's "sp" pointers to "ssp", so that there
is "ssp" for srcu_struct pointer, "snp" for srcu_node pointer, and "sdp"
for srcu_data pointer.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# eb4c2382 26-Oct-2018 Dennis Krein <Dennis.Krein@netapp.com>

srcu: Lock srcu_data structure in srcu_gp_start()

The srcu_gp_start() function is called with the srcu_struct structure's
->lock held, but not with the srcu_data structure's ->lock. This is
problematic because this function accesses and updates the srcu_data
structure's ->srcu_cblist, which is protected by that lock. Failing to
hold this lock can result in corruption of the SRCU callback lists,
which in turn can result in arbitrarily bad results.

This commit therefore makes srcu_gp_start() acquire the srcu_data
structure's ->lock across the calls to rcu_segcblist_advance() and
rcu_segcblist_accelerate(), thus preventing this corruption.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: Sebastian Kuzminsky <seb.kuzminsky@gmail.com>
Signed-off-by: Dennis Krein <Dennis.Krein@netapp.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Dennis Krein <Dennis.Krein@netapp.com>
Cc: <stable@vger.kernel.org> # 4.16.x


# 0607ba84 25-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Prevent __call_srcu() counter wrap with read-side critical section

Ever since cdf7abc4610a ("srcu: Allow use of Tiny/Tree SRCU from
both process and interrupt context"), it has been permissible
to use SRCU read-side critical sections in interrupt context.
This allows __call_srcu() to use SRCU read-side critical sections to
prevent a new SRCU grace period from ending before the call to either
srcu_funnel_gp_start() or srcu_funnel_exp_start completes, thus preventing
SRCU grace-period counter overflow during that time.

Note that this does not permit removal of the counter-wrap checks in
srcu_gp_end(). These check are necessary to handle the case where
a given CPU does not interact at all with SRCU for an extended time
period.

This commit therefore adds an SRCU read-side critical section to
__call_srcu() in order to prevent grace period counter wrap during
the funnel-locking process.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4e6ea4ef 14-Aug-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Make early-boot call_srcu() reuse workqueue lists

Allocating a list_head structure that is almost never used, and, when
used, is used only during early boot (rcu_init() and earlier), is a bit
wasteful. This commit therefore eliminates that list_head in favor of
the one in the work_struct structure. This is safe because the work_struct
structure cannot be used until after rcu_init() returns.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# e0fcba9a 14-Aug-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Make call_srcu() available during very early boot

Event tracing is moving to SRCU in order to take advantage of the fact
that SRCU may be safely used from idle and even offline CPUs. However,
event tracing can invoke call_srcu() very early in the boot process,
even before workqueue_init_early() is invoked (let alone rcu_init()).
Therefore, call_srcu()'s attempts to queue work fail miserably.

This commit therefore detects this situation, and refrains from attempting
to queue work before rcu_init() time, but does everything else that it
would have done, and in addition, adds the srcu_struct to a global list.
The rcu_init() function now invokes a new srcu_init() function, which
is empty if CONFIG_SRCU=n. Otherwise, srcu_init() queues work for
each srcu_struct on the list. This all happens early enough in boot
that there is but a single CPU with interrupts disabled, which allows
synchronization to be dispensed with.

Of course, the queued work won't actually be invoked until after
workqueue_init() is invoked, which happens shortly after the scheduler
is up and running. This means that although call_srcu() may be invoked
any time after per-CPU variables have been set up, there is still a very
narrow window when synchronize_srcu() won't work, and this window
extends from the time that the scheduler starts until the time that
workqueue_init() returns. This can be fixed in a manner similar to
the fix for synchronize_rcu_expedited() and friends, but until someone
actually needs to use synchronize_srcu() during this window, this fix
is added churn for no benefit.

Finally, note that Tree SRCU's new srcu_init() function invokes
queue_work() rather than the queue_delayed_work() function that is
invoked post-boot. The reason is that queue_delayed_work() will (as you
would expect) post a timer, and timers have not yet been initialized.
So use of queue_work() avoids the complaints about use of uninitialized
spinlocks that would otherwise result. Besides, some delay is already
provide by the aforementioned fact that the queued work won't actually
be invoked until after the scheduler is up and running.

Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 6eb95cc4 07-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Clean up flavor-related definitions and comments in srcutree.h

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# aedf4ba9 04-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_node tree accessor macros

There now is only one rcu_state structure in a given build of the Linux
kernel, so there is no need to pass it as a parameter to RCU's rcu_node
tree's accessor macros. This commit therefore removes the rsp parameter
from those macros in kernel/rcu/rcu.h, and removes some now-unused rsp
local variables while in the area.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 52e17ba1 19-Jun-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Add grace-period number to rcutorture statistics printout

This commit adds the SRCU grace-period number to the rcutorture statistics
printout, which allows it to be compared to the rcutorture "Writer stall
state" message.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a7538352 14-May-2018 Joe Perches <joe@perches.com>

rcu: Use pr_fmt to prefix "rcu: " to logging output

This commit also adjusts some whitespace while in the area.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Revert string-breaking %s as requested by Andy Shevchenko. ]


# aebc8264 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Convert rcutorture_get_gp_data() to ->gp_seq

SRCU has long used ->srcu_gp_seq, and now RCU uses ->gp_seq. This
commit therefore moves the rcutorture_get_gp_data() function from
a ->gpnum / ->completed pair to ->gp_seq.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5ab07a8d 22-May-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Add address of first callback to rcutorture output

This commit adds the address of the first callback to the per-CPU rcutorture
output in order to allow lost wakeups to be more efficiently tracked down.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 17294ce6 25-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Document that srcu_funnel_gp_start() implies srcu_funnel_exp_start()

This commit updates the header comment of srcu_funnel_gp_start() to
document the fact that srcu_funnel_gp_start() does the work of
srcu_funnel_exp_start(), in some cases by invoking it directly.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5ef98a63 24-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Fix typos in __call_srcu() header comment

This commit simply changes some copy-pasta call_rcu() instances to
the correct call_srcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f7194ac3 05-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Add cleanup_srcu_struct_quiesced()

The current cleanup_srcu_struct() flushes work, which prevents it
from being invoked from some workqueue contexts, as well as from
atomic (non-blocking) contexts. This patch therefore introduced a
cleanup_srcu_struct_quiesced(), which can be invoked only after all
activity on the specified srcu_struct has completed. This restriction
allows cleanup_srcu_struct_quiesced() to be invoked from workqueue
contexts as well as from atomic contexts.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nitzan Carmi <nitzanc@mellanox.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# ad7c946b 08-Jan-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Create RCU-specific workqueues with rescuers

RCU's expedited grace periods can participate in out-of-memory deadlocks
due to all available system_wq kthreads being blocked and there not being
memory available to create more. This commit prevents such deadlocks
by allocating an RCU-specific workqueue_struct at early boot time, and
providing it with a rescuer to ensure forward progress. This uses the
shiny new init_rescuer() function provided by Tejun (but indirectly).

This commit also causes SRCU to use this new RCU-specific
workqueue_struct. Note that SRCU's use of workqueues never blocks them
waiting for readers, so this should be safe from a forward-progress
viewpoint. Note that this moves SRCU from system_power_efficient_wq
to a normal workqueue. In the unlikely event that this results in
measurable degradation, a separate power-efficient workqueue will be
creates for SRCU.

Reported-by: Prateek Sood <prsood@codeaurora.org>
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>


# 6308f347 14-Feb-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove SRCU throttling

The code in srcu_gp_end() inserts a delay every 0x3ff grace periods in
order to prevent SRCU grace-period work from consuming an entire CPU
when there is a long sequence of expedited SRCU grace-period requests.
However, all of SRCU's grace-period work is carried out in workqueues,
which are in turn within kthreads, which are automatically throttled as
needed by the scheduler. In particular, if there is plenty of idle time,
there is no point in throttling.

This commit therefore removes the expedited SRCU grace-period throttling.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a72da917 14-Feb-2018 Byungchul Park <byungchul.park@lge.com>

srcu: Remove dead code in srcu_gp_end()

Of course, compilers will optimize out a dead code. Anyway, remove
any dead code for better readibility.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8ddbd883 31-Jan-2018 Ildar Ismagilov <devix84@gmail.com>

srcu: Reduce scans of srcu_data in counter wrap check

Currently, given a multi-level srcu_node tree, SRCU can scan the full
set of srcu_data structures at each level when cleaning up after a grace
period. This, though harmless otherwise, represents pointless overhead.
This commit therefore eliminates this overhead by scanning the srcu_data
structures only when traversing the leaf srcu_node structures.

Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a35d13ec 31-Jan-2018 Ildar Ismagilov <devix84@gmail.com>

srcu: Prevent sdp->srcu_gp_seq_needed_exp counter wrap

SRCU checks each srcu_data structure's grace-period number for counter
wrap four times per cycle by default. This frequency guarantees that
normal comparisons will detect potential wrap. However, the expedited
grace-period number is not checked. The consquences are not too horrible
(a failure to expedite a grace period when requested), but it would be
good to avoid such things. This commit therefore adds this check to
the expedited grace-period number.

Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# cb4081cd 01-Dec-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Abstract function name

This commit moves to __func__ for function names in the name of better
resilience to change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 274afd6b 30-Jan-2018 Ildar Ismagilov <devix84@gmail.com>

rcu: Fix misprint in srcu_funnel_exp_start

The srcu_funnel_exp_start() function checks to see if the srcu_struct
structure's expedited grace period counter needs updating to reflect a
newly arrived request for an expedited SRCU grace period. Unfortunately,
the check is backwards, so this commit reverses the sense of the test.

Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a32e01ee 17-Jan-2018 Matthew Wilcox <willy@infradead.org>

rcu: Use wrapper for lockdep asserts

Commits c0b334c5bfa9 and ea9b0c8a26a2 introduced new sparse warnings
by accessing rcu_node->lock directly and ignoring the __private
marker. Introduce a new wrapper and use it. Also fix a similar problem
in srcutree.c introduced by a3883df3935e.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d6331980 10-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Prohibit call_srcu() use under raw spinlocks

Invoking queue_delayed_work() while holding a raw spinlock is forbidden
in -rt kernels, which is exactly what __call_srcu() does, indirectly via
srcu_funnel_gp_start(). This commit therefore downgrades Tree SRCU's
locking from raw to non-raw spinlocks, which works because call_srcu()
is not ever called while holding a raw spinlock.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e4d0b679 17-Sep-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Add parameters to SRCU docbook comments

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 27fdb35f 19-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

doc: Fix various RCU docbook comment-header problems

Because many of RCU's files have not been included into docbook, a
number of errors have accumulated. This commit fixes them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 35732cf9 05-Jul-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Provide ordering for CPU not involved in grace period

Tree RCU guarantees that every online CPU has a memory barrier between
any given grace period and any of that CPU's RCU read-side sections that
must be ordered against that grace period. Since RCU doesn't always
know where read-side critical sections are, the actual implementation
guarantees order against prior and subsequent non-idle non-offline code,
whether in an RCU read-side critical section or not. As a result, there
does not need to be a memory barrier at the end of synchronize_rcu()
and friends because the ordering internal to the grace period has
ordered every CPU's post-grace-period execution against each CPU's
pre-grace-period execution, again for all non-idle online CPUs.

In contrast, SRCU can have non-idle online CPUs that are completely
uninvolved in a given SRCU grace period, for example, a CPU that
never runs any SRCU read-side critical sections and took no part in
the grace-period processing. It is in theory possible for a given
synchronize_srcu()'s wakeup to be delivered to a CPU that was completely
uninvolved in the prior SRCU grace period, which could mean that the
code following that synchronize_srcu() would end up being unordered with
respect to both the grace period and any pre-existing SRCU read-side
critical sections.

This commit therefore adds an smp_mb() to the end of __synchronize_srcu(),
which prevents this scenario from occurring.

Reported-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Lance Roy <ldr709@gmail.com>
Cc: <stable@vger.kernel.org> # 4.12.x


# ac3748c6 22-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Print SRCU lock/unlock totals

This commit adds printing of SRCU lock/unlock totals, which are just
the sums of the per-CPU counts. Saves a bit of mental arithmetic.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 115a1a52 22-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Move SRCU status printing to SRCU implementations

This commit gets rid of some ugly #ifdefs in rcutorture.c by moving
the SRCU status printing to the SRCU implementations.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 0d8a1e83 15-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Make process_srcu() be static

The function process_srcu() is not invoked outside of srcutree.c, so
this commit makes it static and drops the EXPORT_SYMBOL_GPL().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a3883df3 09-May-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Use rnp->lock wrappers to replace explicit memory barriers

This commit uses TREE RCU's rnp->lock wrappers to replace a few explicit
memory barriers. This change also has the advantage of making SRCU's
memory-ordering properties be implemented in roughly the same way as they
are in Tree RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5a0465e1 04-May-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Shrink srcu.h by moving docbook and private function

The call_srcu() docbook entry is currently in include/linux/srcu.h,
which causes needless processing for each include point. This commit
therefore moves this entry to kernel/rcu/srcutree.c, which the compiler
reads only once. In addition, the srcu_batches_completed() function is
used only within RCU and its torture-test suites. This commit therefore
also moves this function's declaration from include/linux/srcutiny.h,
include/linux/srcutree.h, and include/linux/srcuclassic.h to
kernel/rcu/rcu.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c350c008 03-May-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Prevent sdp->srcu_gp_seq_needed counter wrap

If a given CPU never happens to ever start an SRCU grace period, the
grace-period sequence counter might wrap. If this CPU were to decide to
finally start a grace period, the state of its sdp->srcu_gp_seq_needed
might make it appear that it has already requested this grace period,
which would prevent starting the grace period. If no other CPU ever started
a grace period again, this would look like a grace-period hang. Even
if some other CPU took pity and started the needed grace period, the
leaf rcu_node structure's ->srcu_data_have_cbs field won't have record
of the fact that this CPU has a callback pending, which would look like
a very localized grace-period hang.

This might seem very unlikely, but SRCU grace periods can take less than
a microsecond on small systems, which means that overflow can happen
in much less than an hour on a 32-bit embedded system. And embedded
systems are especially likely to have long-term idle CPUs. Therefore,
it makes sense to prevent this scenario from happening.

This commit therefore scans each srcu_data structure occasionally,
with frequency controlled by the srcutree.counter_wrap_check kernel
boot parameter. This parameter can be set to something like 255
in order to exercise the counter-wrap-prevention code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a602538e 28-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Add DEBUG_OBJECTS_RCU_HEAD functionality

This commit adds DEBUG_OBJECTS_RCU_HEAD checking to detect call_srcu()
counterparts to double-free bugs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 0c8e0e3c 28-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Print non-default exp_holdoff values at boot time

This commit makes srcu_bootup_announce() check for non-default values
of the auto-expedite holdoff time exp_holdoff and print a message if so.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b5815e6c 28-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Make exp_holdoff module parameter be static

Because exp_holdoff is not used outside of srcutree.c, it can be static.
This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 1f4f6da1 21-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Make Classic and Tree SRCU announce themselves at bootup

Currently, the only way to tell whether a given kernel is running
Classic, Tiny, or Tree SRCU is to look at the .config file, which
can easily be lost or associated with the wrong kernel. This commit
therefore has Classic and Tree SRCU identify themselves at boot time.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 881ec9d2 12-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Eliminate possibility of destructive counter overflow

Earlier versions of Tree SRCU were subject to a counter overflow bug that
could theoretically result in too-short grace periods. This commit
eliminates this problem by adding an update-side memory barrier.
The short explanation is that if the updater sums the unlock counts
too late to see a given __srcu_read_unlock() increment, that CPU's
next __srcu_read_lock() must see the new value of ->srcu_idx, thus
incrementing the other bank of counters. This eliminates the possibility
of destructive counter overflow as long as the srcu_read_lock() nesting
level does not exceed floor(ULONG_MAX/NR_CPUS/2), which should be an
eminently reasonable nesting limit, especially on 64-bit systems.

Reported-by: Lance Roy <ldr709@gmail.com>
Suggested-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# cdf7abc4 31-May-2017 Paolo Bonzini <pbonzini@redhat.com>

srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context

Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device. This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq(). If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case. KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods. It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).

However, the docs are overly conservative. You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts. In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU. For those two implementations, only srcu_read_lock()
is unsafe.

When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller. Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.

Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
a single variable. Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
implementation already supports mixed-context use of srcu_read_lock()
and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
and srcu_read_unlock() in each handler are nested and paired properly.
In other words, it is still illegal to (say) invoke srcu_read_lock()
in an interrupt handler and to invoke the matching srcu_read_unlock()
in a softirq handler. Therefore, the only change required for Tiny SRCU
is to its comments.

Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>


# 45753c5f 02-May-2017 Ingo Molnar <mingo@kernel.org>

srcu: Debloat the <linux/rcu_segcblist.h> header

Linus noticed that the <linux/rcu_segcblist.h> has huge inline functions
which should not be inline at all.

As a first step in cleaning this up, move them all to kernel/rcu/ and
only keep an absolute minimum of data type defines in the header:

before: -rw-r--r-- 1 mingo mingo 22284 May 2 10:25 include/linux/rcu_segcblist.h
after: -rw-r--r-- 1 mingo mingo 3180 May 2 10:22 include/linux/rcu_segcblist.h

More can be done, such as uninlining the large functions, which inlining
is unjustified even if it's an RCU internal matter.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b5fe223a 27-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Adjust default auto-expediting holdoff

The default value for the kernel boot parameter srcutree.exp_holdoff
is 50 microseconds, which is too long for good Tree SRCU performance
(compared to Classic SRCU) on the workloads tested by Mike Galbraith.
This commit therefore sets the default value to 25 microseconds, which
shows excellent results in Mike's testing.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>


# 22607d66 25-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Specify auto-expedite holdoff time

On small systems, in the absence of readers, expedited SRCU grace
periods can complete in less than a microsecond. This means that an
eight-CPU system can have all CPUs doing synchronize_srcu() in a tight
loop and almost always expedite. This might actually be desirable in
some situations, but in general it is a good way to needlessly burn
CPU cycles. And in those situations where it is desirable, your friend
is the function synchronize_srcu_expedited().

For other situations, this commit adds a kernel parameter that specifies
a holdoff between completing the last SRCU grace period and auto-expediting
the next. If the next grace period starts before the holdoff expires,
auto-expediting is disabled. The holdoff is 50 microseconds by default,
and can be tuned to the desired number of nanoseconds. A value of zero
disables auto-expediting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>


# 2da4b2a7 25-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Expedite first synchronize_srcu() when idle

Classic SRCU in effect expedites the first synchronize_srcu() when SRCU
is idle, and Mike Galbraith demonstrated that some use cases do in fact
rely on this behavior. In particular, Mike showed that Steven Rostedt's
hotplug stress script takes 55 seconds with Classic SRCU and more than
16 -minutes- when running Tree SRCU. Assuming that each Tree SRCU's call
to synchronize_srcu() takes four milliseconds, this implies that Steven's
test invokes synchronize_srcu() in isolation, but more than once per
200 microseconds. Mike used ftrace to demonstrate that the time between
successive calls to synchronize_srcu() ranged from 118 to 342 microseconds,
with one outlier at 80 milliseconds. This data clearly indicates that
Tree SRCU needs to expedite the first invocation of synchronize_srcu()
during an SRCU idle period.

This commit therefor introduces a srcu_might_be_idle() function that
probabilistically checks whether or not SRCU is idle. This function is
used by synchronize_rcu() as an additional criterion in deciding whether
or not to expedite.

(Hat trick to Peter Zijlstra for his earlier suggestion that this might
in fact be a problem. Which for all I know might have motivated Mike to
look into it.)

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>


# 1e9a038b 24-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Expedited grace periods with reduced memory contention

Commit f60d231a87c5 ("srcu: Crude control of expedited grace periods")
introduced a per-srcu_struct atomic counter to track outstanding
requests for grace periods. This works, but represents a memory-contention
bottleneck. This commit therefore uses the srcu_node combining tree
to remove this bottleneck.

This commit adds new ->srcu_gp_seq_needed_exp fields to the
srcu_data, srcu_node, and srcu_struct structures, which track the
farthest-in-the-future grace period that must be expedited, which in
turn requires that all nearer-term grace periods also be expedited.
Requests for expediting start with the srcu_data structure, run up
through the srcu_node tree, and end at the srcu_struct structure.
Note that it may be necessary to expedite a grace period that just
now started, and this is handled by a new srcu_funnel_exp_start()
function, which is invoked when the grace period itself is already
in its way, but when that grace period was not marked as expedited.

A new srcu_get_delay() function returns zero if there is at least one
expedited SRCU grace period in flight, or SRCU_INTERVAL otherwise.
This function is used to calculate delays: Normal grace periods
are allowed to extend in order to cover more requests with a given
grace-period computation, which decreases per-request overhead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>


# 7f6733c3 18-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Make rcutorture writer stalls print SRCU GP state

In the past, SRCU was simple enough that there was little point in
making the rcutorture writer stall messages print the SRCU grace-period
number state. With the advent of Tree SRCU, this has changed. This
commit therefore makes Classic, Tiny, and Tree SRCU report this state
to rcutorture as needed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>


# c7e88067 18-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Exact tracking of srcu_data structures containing callbacks

The current Tree SRCU implementation schedules a workqueue for every
srcu_data covered by a given leaf srcu_node structure having callbacks,
even if only one of those srcu_data structures actually contains
callbacks. This is clearly inefficient for workloads that don't feature
callbacks everywhere all the time. This commit therefore adds an array
of masks that are used by the leaf srcu_node structures to track exactly
which srcu_data structures contain callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>


# 0497b489 18-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Expedite srcu_schedule_cbs_snp() callback invocation

Although Tree SRCU does reduce delays when there is at least one
synchronize_srcu_expedited() invocation pending, srcu_schedule_cbs_snp()
still waits for SRCU_INTERVAL before invoking callbacks. Since
synchronize_srcu_expedited() now posts a callback and waits for
that callback to do a wakeup, this destroys the expedited nature of
synchronize_srcu_expedited(). This destruction became apparent to
Marc Zyngier in the guise of a guest-OS bootup slowdown from five
seconds to no fewer than forty seconds.

This commit therefore invokes callbacks immediately at the end of the
grace period when there is at least one synchronize_srcu_expedited()
invocation pending. This brought Marc's guest-OS bootup times back
into the realm of reason.

Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>


# da915ad5 05-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Parallelize callback handling

Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
however, there are workloads that could result in a high volume of
concurrent invocations of call_srcu(), which with current SRCU would
result in excessive lock contention on the srcu_struct structure's
->queue_lock, which protects SRCU's callback lists. This commit therefore
moves SRCU to per-CPU callback lists, thus greatly reducing contention.

Because a given SRCU instance no longer has a single centralized callback
list, starting grace periods and invoking callbacks are both more complex
than in the single-list Classic SRCU implementation. Starting grace
periods and handling callbacks are now handled using an srcu_node tree
that is in some ways similar to the rcu_node trees used by RCU-bh,
RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
controlled by exactly the same Kconfig options and boot parameters that
control the shape of the rcu_node tree).

In addition, the old per-CPU srcu_array structure is now named srcu_data
and contains an rcu_segcblist structure named ->srcu_cblist for its
callbacks (and a spinlock to protect this). The srcu_struct gets
an srcu_gp_seq that is used to associate callback segments with the
corresponding completion-time grace-period number. These completion-time
grace-period numbers are propagated up the srcu_node tree so that the
grace-period workqueue handler can determine whether additional grace
periods are needed on the one hand and where to look for callbacks that
are ready to be invoked.

The srcu_barrier() function must now wait on all instances of the per-CPU
->srcu_cblist. Because each ->srcu_cblist is protected by ->lock,
srcu_barrier() can remotely add the needed callbacks. In theory,
it could also remotely start grace periods, but in practice doing so
is complex and racy. And interestingly enough, it is never necessary
for srcu_barrier() to start a grace period because srcu_barrier() only
enqueues a callback when a callback is already present--and it turns out
that a grace period has to have already been started for this pre-existing
callback. Furthermore, it is only the callback that srcu_barrier()
needs to wait on, not any particular grace period. Therefore, a new
rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
callback into the same segment occupied by the last pre-existing callback
in the list. The special case where all the pre-existing callbacks are
on a different list (because they are in the process of being invoked)
is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
segment, relying on the done-callbacks check that takes place after all
callbacks are inovked.

Note that the readers use the same algorithm as before. Note that there
is a separate srcu_idx that tells the readers what counter to increment.
This unfortunately cannot be combined with srcu_gp_seq because they
need to be incremented at different times.

This commit introduces some ugly #ifdefs in rcutorture. These will go
away when I feel good enough about Tree SRCU to ditch Classic SRCU.

Some crude performance comparisons, courtesy of a quickly hacked rcuperf
asynchronous-grace-period capability:

Callback Queuing Overhead
-------------------------
# CPUS Classic SRCU Tree SRCU
------ ------------ ---------
2 0.349 us 0.342 us
16 31.66 us 0.4 us
41 --------- 0.417 us

The times are the 90th percentiles, a statistic that was chosen to reject
the overheads of the occasional srcu_barrier() call needed to avoid OOMing
the test machine. The rcuperf test hangs when running Classic SRCU at 41
CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code
and that statistics, this is a convincing demonstration of Tree SRCU's
performance and scalability advantages.

[1] https://lwn.net/Articles/309030/
[2] https://patchwork.kernel.org/patch/5108281/

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]


# dad81a20 25-Mar-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Introduce CLASSIC_SRCU Kconfig option

The TREE_SRCU rewrite is large and a bit on the non-simple side, so
this commit helps reduce risk by allowing the old v4.11 SRCU algorithm
to be selected using a new CLASSIC_SRCU Kconfig option that depends
on RCU_EXPERT. The default is to use the new TREE_SRCU and TINY_SRCU
algorithms, in order to help get these the testing that they need.
However, if your users do not require the update-side scalability that
is to be provided by TREE_SRCU, select RCU_EXPERT and then CLASSIC_SRCU
to revert back to the old classic SRCU algorithm.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>