#
67050837 |
|
26-Dec-2023 |
Joel Fernandes (Google) <joel@joelfernandes.org> |
srcu: Improve comments about acceleration leak The comments added in commit 1ef990c4b36b ("srcu: No need to advance/accelerate if no callback enqueued") are a bit confusing. The comments are describing a scenario for code that was moved and is no longer the way it was (snapshot after advancing). Improve the code comments to reflect this and also document why acceleration can never fail. Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
|
#
c21357e4 |
|
03-Oct-2023 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Explain why callbacks invocations can't run concurrently If an SRCU barrier is queued while callbacks are running and a new callbacks invocator for the same sdp were to run concurrently, the RCU barrier might execute too early. As this requirement is non-obvious, make sure to keep a record. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
|
#
94c55b9e |
|
03-Oct-2023 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: No need to advance/accelerate if no callback enqueued While in grace period start, there is nothing to accelerate and therefore no need to advance the callbacks either if no callback is to be enqueued. Spare these needless operations in this case. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
|
#
20eb4142 |
|
03-Oct-2023 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Remove superfluous callbacks advancing from srcu_gp_start() Callbacks advancing on SRCU must be performed on two specific places: 1) On enqueue time in order to make room for the acceleration of the new callback. 2) On invocation time in order to move the callbacks ready to invoke. Any other callback advancing callsite is needless. Remove the remaining one in srcu_gp_start(). Co-developed-by: Yong He <zhuangel570@gmail.com> Signed-off-by: Yong He <zhuangel570@gmail.com> Co-developed-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
|
#
8a77f38b |
|
03-Oct-2023 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Only accelerate on enqueue time Acceleration in SRCU happens on enqueue time for each new callback. This operation is expected not to fail and therefore any similar attempt from other places shouldn't find any remaining callbacks to accelerate. Moreover accelerations performed beyond enqueue time are error prone because rcu_seq_snap() then may return the snapshot for a new grace period that is not going to be started. Remove these dangerous and needless accelerations and introduce instead assertions reporting leaking unaccelerated callbacks beyond enqueue time. Co-developed-by: Yong He <alexyonghe@tencent.com> Signed-off-by: Yong He <alexyonghe@tencent.com> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Reviewed-by: Like Xu <likexu@tencent.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
#
4a8e65b0 |
|
03-Oct-2023 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Fix callbacks acceleration mishandling SRCU callbacks acceleration might fail if the preceding callbacks advance also fails. This can happen when the following steps are met: 1) The RCU_WAIT_TAIL segment has callbacks (say for gp_num 8) and the RCU_NEXT_READY_TAIL also has callbacks (say for gp_num 12). 2) The grace period for RCU_WAIT_TAIL is observed as started but not yet completed so rcu_seq_current() returns 4 + SRCU_STATE_SCAN1 = 5. 3) This value is passed to rcu_segcblist_advance() which can't move any segment forward and fails. 4) srcu_gp_start_if_needed() still proceeds with callback acceleration. But then the call to rcu_seq_snap() observes the grace period for the RCU_WAIT_TAIL segment (gp_num 8) as completed and the subsequent one for the RCU_NEXT_READY_TAIL segment as started (ie: 8 + SRCU_STATE_SCAN1 = 9) so it returns a snapshot of the next grace period, which is 16. 5) The value of 16 is passed to rcu_segcblist_accelerate() but the freshly enqueued callback in RCU_NEXT_TAIL can't move to RCU_NEXT_READY_TAIL which already has callbacks for a previous grace period (gp_num = 12). So acceleration fails. 6) Note in all these steps, srcu_invoke_callbacks() hadn't had a chance to run srcu_invoke_callbacks(). Then some very bad outcome may happen if the following happens: 7) Some other CPU races and starts the grace period number 16 before the CPU handling previous steps had a chance. Therefore srcu_gp_start() isn't called on the latter sdp to fix the acceleration leak from previous steps with a new pair of call to advance/accelerate. 8) The grace period 16 completes and srcu_invoke_callbacks() is finally called. All the callbacks from previous grace periods (8 and 12) are correctly advanced and executed but callbacks in RCU_NEXT_READY_TAIL still remain. Then rcu_segcblist_accelerate() is called with a snaphot of 20. 9) Since nothing started the grace period number 20, callbacks stay unhandled. This has been reported in real load: [3144162.608392] INFO: task kworker/136:12:252684 blocked for more than 122 seconds. [3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1 [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [3144162.631162] kworker/136:12 D 0 252684 2 0x90004000 [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm] [3144162.631192] Call Trace: [3144162.631202] __schedule+0x2ee/0x660 [3144162.631206] schedule+0x33/0xa0 [3144162.631209] schedule_timeout+0x1c4/0x340 [3144162.631214] ? update_load_avg+0x82/0x660 [3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30 [3144162.631218] wait_for_completion+0x119/0x180 [3144162.631220] ? wake_up_q+0x80/0x80 [3144162.631224] __synchronize_srcu.part.19+0x81/0xb0 [3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10 [3144162.631227] synchronize_srcu+0x5f/0xc0 [3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm] [3144162.631239] ? __schedule+0x2f6/0x660 [3144162.631243] process_one_work+0x19a/0x3a0 [3144162.631244] worker_thread+0x37/0x3a0 [3144162.631247] kthread+0x117/0x140 [3144162.631247] ? process_one_work+0x3a0/0x3a0 [3144162.631248] ? __kthread_cancel_work+0x40/0x40 [3144162.631250] ret_from_fork+0x1f/0x30 Fix this with taking the snapshot for acceleration _before_ the read of the current grace period number. The only side effect of this solution is that callbacks advancing happen then _after_ the full barrier in rcu_seq_snap(). This is not a problem because that barrier only cares about: 1) Ordering accesses of the update side before call_srcu() so they don't bleed. 2) See all the accesses prior to the grace period of the current gp_num The only things callbacks advancing need to be ordered against are carried by snp locking. Reported-by: Yong He <alexyonghe@tencent.com> Co-developed-by:: Yong He <alexyonghe@tencent.com> Signed-off-by: Yong He <alexyonghe@tencent.com> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Link: http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com Fixes: da915ad5cf25 ("srcu: Parallelize callback handling") Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
#
d8d5b7bf |
|
04-Sep-2023 |
Denis Arefev <arefev@swemel.ru> |
srcu: Fix srcu_struct node grpmask overflow on 64-bit systems The value of a bitwise expression 1 << (cpu - sdp->mynode->grplo) is subject to overflow due to a failure to cast operands to a larger data type before performing the bitwise operation. The maximum result of this subtraction is defined by the RCU_FANOUT_LEAF Kconfig option, which on 64-bit systems defaults to 16 (resulting in a maximum shift of 15), but which can be set up as high as 64 (resulting in a maximum shift of 63). A value of 31 can result in sign extension, resulting in 0xffffffff80000000 instead of the desired 0x80000000. A value of 32 or greater triggers undefined behavior per the C standard. This bug has not been known to cause issues because almost all kernels take the default CONFIG_RCU_FANOUT_LEAF=16. Furthermore, as long as a given compiler gives a deterministic non-zero result for 1<<N for N>=32, the code correctly invokes all SRCU callbacks, albeit wasting CPU time along the way. This commit therefore substitutes the correct 1UL for the buggy 1. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Denis Arefev <arefev@swemel.ru> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
#
2cbc482d |
|
04-Aug-2023 |
Zhen Lei <thunder.leizhen@huawei.com> |
rcu: Dump memory object info if callback function is invalid When a structure containing an RCU callback rhp is (incorrectly) freed and reallocated after rhp is passed to call_rcu(), it is not unusual for rhp->func to be set to NULL. This defeats the debugging prints used by __call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which expect to identify the offending code using the identity of this function. And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things are even worse, as can be seen from this splat: Unable to handle kernel NULL pointer dereference at virtual address 0 ... ... PC is at 0x0 LR is at rcu_do_batch+0x1c0/0x3b8 ... ... (rcu_do_batch) from (rcu_core+0x1d4/0x284) (rcu_core) from (__do_softirq+0x24c/0x344) (__do_softirq) from (__irq_exit_rcu+0x64/0x108) (__irq_exit_rcu) from (irq_exit+0x8/0x10) (irq_exit) from (__handle_domain_irq+0x74/0x9c) (__handle_domain_irq) from (gic_handle_irq+0x8c/0x98) (gic_handle_irq) from (__irq_svc+0x5c/0x94) (__irq_svc) from (arch_cpu_idle+0x20/0x3c) (arch_cpu_idle) from (default_idle_call+0x4c/0x78) (default_idle_call) from (do_idle+0xf8/0x150) (do_idle) from (cpu_startup_entry+0x18/0x20) (cpu_startup_entry) from (0xc01530) This commit therefore adds calls to mem_dump_obj(rhp) to output some information, for example: slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256 This provides the rough size of the memory block and the offset of the rcu_head structure, which as least provides at least a few clues to help locate the problem. If the problem is reproducible, additional slab debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can provide significantly more information. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
#
f0a31b26 |
|
29-Jul-2023 |
Joel Fernandes (Google) <joel@joelfernandes.org> |
srcu: Fix error handling in init_srcu_struct_fields() The current error handling in init_srcu_struct_fields() is a bit inconsistent. If init_srcu_struct_nodes() fails, the function either returns -ENOMEM or 0 depending on whether ssp->sda_is_static is true or false. This can make init_srcu_struct_fields() return 0 even if memory allocation failed! Simplify the error handling by always returning -ENOMEM if either init_srcu_struct_nodes() or the per-CPU allocation fails. This makes the control flow easier to follow and avoids the inconsistent return values. Add goto labels to avoid duplicating the error cleanup code. Link: https://lore.kernel.org/r/20230404003508.GA254019@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
#
754aa642 |
|
27-Jan-2023 |
Joel Fernandes (Google) <joel@joelfernandes.org> |
srcu: Clarify comments on memory barrier "E" There is an smp_mb() named "E" in srcu_flip() immediately before the increment (flip) of the srcu_struct structure's ->srcu_idx. The purpose of E is to order the preceding scan's read of lock counters against the flipping of the ->srcu_idx, in order to prevent new readers from continuing to use the old ->srcu_idx value, which might needlessly extend the grace period. However, this ordering is already enforced because of the control dependency between the preceding scan and the ->srcu_idx flip. This control dependency exists because atomic_long_read() is used to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and ->srcu_unlock_count[] counts match. And such a match cannot happen when there is an in-flight reader that started before the flip (observation courtesy Mathieu Desnoyers). The litmus test below (courtesy of Frederic Weisbecker, with changes for ctrldep by Boqun and Joel) shows this: C srcu (* * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. * * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf * relations. Combining the ->ppo with ->rf, a cycle is impossible. *) {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP if (lock1 == unlock1) { // Control dep smp_mb(); // E // Remove E and still passes. WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx1; int idx2; // 1st reader idx1 = READ_ONCE(*IDX); if (idx1 == 0) { // Control dep tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:lock1=1 /\ 1:idx1=1) More complicated litmus tests with multiple SRCU readers also show that memory barrier E is not needed. This commit therefore clarifies the comment on memory barrier E. Why not also remove that redundant smp_mb()? Because control dependencies are quite fragile due to their not being recognized by most compilers and tools. Control dependencies therefore exact an ongoing maintenance burden, and such a burden cannot be justified in this slowpath. Therefore, that smp_mb() stays until such time as its overhead becomes a measurable problem in a real workload running on a real production system, or until such time as compilers start paying attention to this sort of control dependency. Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
#
cefc0a59 |
|
18-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Fix long lines in srcu_funnel_gp_start() This commit creates an srcu_usage pointer named "sup" as a shorter synonym for the "ssp->srcu_sup" that was bloating several lines of code. Cc: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
6c366522 |
|
18-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Fix long lines in srcu_gp_end() This commit creates an srcu_usage pointer named "sup" as a shorter synonym for the "ssp->srcu_sup" that was bloating several lines of code. Cc: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
5ff8319f |
|
18-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Fix long lines in cleanup_srcu_struct() This commit creates an srcu_usage pointer named "sup" as a shorter synonym for the "ssp->srcu_sup" that was bloating several lines of code. Cc: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
eabe7625 |
|
18-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Fix long lines in srcu_get_delay() This commit creates an srcu_usage pointer named "sup" as a shorter synonym for the "ssp->srcu_sup" that was bloating several lines of code. Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Cc: Christoph Hellwig <hch@lst.de> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
a7bf4d7c |
|
24-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Check for readers at module-exit time If a given statically allocated in-module srcu_struct structure was ever used for updates, srcu_module_going() will invoke cleanup_srcu_struct() at module-exit time. This will check for the error case of SRCU readers persisting past module-exit time. On the other hand, if this srcu_struct structure never went through a grace period, srcu_module_going() only invokes free_percpu(), which would result in strange failures if SRCU readers persisted past module-exit time. This commit therefore adds a srcu_readers_active() check to srcu_module_going(), splatting if readers have persisted and refraining from invoking free_percpu() in that case. Better to leak memory than to suffer silent memory corruption! [ paulmck: Apply Zhang, Qiang1 feedback on memory leak. ] Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
fd1b3f8e |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move work-scheduling fields from srcu_struct to srcu_usage This commit moves the ->reschedule_jiffies, ->reschedule_count, and ->work fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. However, this means that the container_of() calls cannot get a pointer to the srcu_struct because they are no longer in the srcu_struct. This issue is addressed by adding a ->srcu_ssp field in the srcu_usage structure that references the corresponding srcu_struct structure. And given the presence of the sup pointer to the srcu_usage structure, replace some ssp->srcu_usage-> instances with sup->. [ paulmck Apply feedback from kernel test robot. ] Link: https://lore.kernel.org/oe-kbuild-all/202303191400.iO5BOqka-lkp@intel.com/ Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
d20162e0 |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move srcu_barrier() fields from srcu_struct to srcu_usage This commit moves the ->srcu_barrier_seq, ->srcu_barrier_mutex, ->srcu_barrier_completion, and ->srcu_barrier_cpu_cnt fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
660349ac |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move ->sda_is_static from srcu_struct to srcu_usage This commit moves the ->sda_is_static field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
3b46679c |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move heuristics fields from srcu_struct to srcu_usage This commit moves the ->srcu_size_jiffies, ->srcu_n_lock_retries, and ->srcu_n_exp_nodelay fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
03200b5c |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move grace-period fields from srcu_struct to srcu_usage This commit moves the ->srcu_gp_seq, ->srcu_gp_seq_needed, ->srcu_gp_seq_needed_exp, ->srcu_gp_start, and ->srcu_last_gp_end fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
e3a6ab25 |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move ->srcu_gp_mutex from srcu_struct to srcu_usage This commit moves the ->srcu_gp_mutex field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
b3fb11f7 |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move ->lock from srcu_struct to srcu_usage This commit moves the ->lock field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
0839ade9 |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move ->lock initialization after srcu_usage allocation Currently, both __init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=y kernels and init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=n kernel initialize the srcu_struct structure's ->lock before the srcu_usage structure has been allocated. This of course prevents the ->lock from being moved to the srcu_usage structure, so this commit moves the initialization into the init_srcu_struct_fields() after the srcu_usage structure has been allocated. Cc: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
574dc1a7 |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move ->srcu_cb_mutex from srcu_struct to srcu_usage This commit moves the ->srcu_cb_mutex field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
a0d8cbd3 |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move ->srcu_size_state from srcu_struct to srcu_usage This commit moves the ->srcu_size_state field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
208f41b1 |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Move ->level from srcu_struct to srcu_usage This commit moves the ->level[] array from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
95433f72 |
|
16-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Begin offloading srcu_struct fields to srcu_update The current srcu_struct structure is on the order of 200 bytes in size (depending on architecture and .config), which is much better than the old-style 26K bytes, but still all too inconvenient when one is trying to achieve good cache locality on a fastpath involving SRCU readers. However, only a few fields in srcu_struct are used by SRCU readers. The remaining fields could be offloaded to a new srcu_update structure, thus shrinking the srcu_struct structure down to a few tens of bytes. This commit begins this noble quest, a quest that is complicated by open-coded initialization of the srcu_struct within the srcu_notifier_head structure. This complication is addressed by updating the srcu_notifier_head structure's open coding, given that there does not appear to be a straightforward way of abstracting that initialization. This commit moves only the ->node pointer to srcu_update. Later commits will move additional fields. [ paulmck: Fold in qiang1.zhang@intel.com's memory-leak fix. ] Link: https://lore.kernel.org/all/20230320055751.4120251-1-qiang1.zhang@intel.com/ Suggested-by: Christoph Hellwig <hch@lst.de> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
f4d01a25 |
|
17-Mar-2023 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Use static init for statically allocated in-module srcu_struct Further shrinking the srcu_struct structure is eased by requiring that in-module srcu_struct structures rely more heavily on static initialization. In particular, this preserves the property that a module-load-time srcu_struct initialization can fail only due to memory-allocation failure of the per-CPU srcu_data structures. It might also slightly improve robustness by keeping the number of memory allocations that must succeed down percpu_alloc() call. This is in preparation for splitting an srcu_usage structure out of the srcu_struct structure. [ paulmck: Fold in qiang1.zhang@intel.com feedback. ] Cc: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
f0f44752 |
|
12-Jan-2023 |
Boqun Feng <boqun.feng@gmail.com> |
rcu: Annotate SRCU's update-side lockdep dependencies Although all flavors of RCU readers are annotated correctly with lockdep as recursive read locks, they do not set the lock_acquire 'check' parameter. This means that RCU read locks are not added to the lockdep dependency graph, which in turn means that lockdep cannot detect RCU-based deadlocks. This is not a problem for RCU flavors having atomic read-side critical sections because context-based annotations can catch these deadlocks, see for example the RCU_LOCKDEP_WARN() statement in synchronize_rcu(). But context-based annotations are not helpful for sleepable RCU, especially given that it is perfectly legal to do synchronize_srcu(&srcu1) within an srcu_read_lock(&srcu2). However, we can detect SRCU-based by: (1) Making srcu_read_lock() a 'check'ed recursive read lock and (2) Making synchronize_srcu() a empty write lock critical section. Even better, with the newly introduced lock_sync(), we can avoid false positives about irq-unsafe/safe. This commit therefore makes it so. Note that NMI-safe SRCU read side critical sections are currently not annotated, but might be annotated in the future. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ boqun: Add comments for annotation per Waiman's suggestion ] [ boqun: Fix comment warning reported by Stephen Rothwell ] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
|
#
dafc4d16 |
|
21-Dec-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Update comment after the index flip Because there is not guaranteed to be a full memory barrier between the ->srcu_unlock_count increment of an srcu_read_unlock() and the ->srcu_lock_count increment of the next srcu_read_lock(), this next srcu_read_lock() is not guaranteed to see the effect of the index flip just prior to this comment. However, this next srcu_read_lock() will execute a full memory barrier, so the srcu_read_lock() after that is guaranteed to see that index flip. This guarantee is illustrated by the following diagram of events and the litmus test following that. ------------------------------------------------------------------------ READER UPDATER ------------- ---------- // idx is initially 0. srcu_flip() { smp_mb(); // RSCS srcu_read_unlock() { smp_mb(); idx++; // P smp_mb(); // QQ } srcu_readers_unlock_idx(0) { ,--counted------------ count all unlock[0]; // Q | unlock[0]++; // X } smp_mb(); srcu_read_lock() { READ(idx) = 0; ,---- count all lock[0]; // contributes imbalance of 1. lock[0]++; ----counted | smp_mb(); // PP } | } | | // RSCS not going to effect above scan | srcu_read_unlock() { | smp_mb(); | unlock[0]++; | } | / / srcu_read_lock() { | READ(idx); // Y -----cannot be counted because of P (has to sample idx as 1) lock[1]++; ... } ------------------------------------------------------------------------ This makes it similar to the store buffer pattern. Using X, Y, P and Q annotated above, we get: ------------------------------------------------------------------------ READER UPDATER X (write) P (write) smp_mb(); //PP smp_mb(); //QQ Y (read) Q (read) ------------------------------------------------------------------------ ASCII art courtesy of Joel Fernandes. Reported-by: Joel Fernandes <joel@joelfernandes.org> Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
0cd4b50b |
|
14-Dec-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Yet more detail for srcu_readers_active_idx_check() comments The comment in srcu_readers_active_idx_check() following the smp_mb() is out of date, hailing from a simpler time when preemption was disabled across the bulk of __srcu_read_lock(). The fact that preemption was disabled meant that the number of tasks that had fetched the old index but not yet incremented counters was limited by the number of CPUs. In our more complex modern times, the number of CPUs is no longer a limit. This commit therefore updates this comment, additionally giving more memory-ordering detail. [ paulmck: Apply Nt->Nc feedback from Joel Fernandes. ] Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Reported-by: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Reported-by: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
1bafbfb3 |
|
23-Nov-2022 |
Pingfan Liu <kernelfans@gmail.com> |
srcu: Remove needless rcu_seq_done() check while holding read lock The srcu_gp_start_if_needed() function now read-holds the srcu_struct whose grace period is being started, which means that the corresponding SRCU grace period cannot end. This in turn means that the SRCU grace-period sequence number returned by rcu_seq_snap() cannot expire during this time. And that means that the calls to rcu_seq_done() in srcu_funnel_exp_start() and srcu_funnel_gp_start() can never return true. This commit therefore removes these rcu_seq_done() checks, but adds checks in kernels built with CONFIG_PROVE_RCU=y that splats if rcu_seq_done() does somehow return true. [ paulmck: Rearrange checks to handle kernels built with lockdep. ] Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: rcu@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
50be0c04 |
|
15-Nov-2022 |
Pingfan Liu <kernelfans@gmail.com> |
srcu: Fix the comparision in srcu_invl_snp_seq() A grace-period sequence number contains two fields: counter and state. SRCU_SNP_INIT_SEQ provides a guaranteed invalid value for grace-period sequence numbers in newly allocated srcu_node structures' ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp fields. The point of the comparison in srcu_invl_snp_seq() is not to detect invalid grace-period sequence numbers in general, but rather to detect a newly allocated srcu_node structure whose ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp fields need to be brought into line with the srcu_struct structure's ->srcu_gp_seq field. This commit therefore causes srcu_invl_snp_seq() to compare both fields of the specified grace-period sequence number. Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: <rcu@vger.kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
7f24626d |
|
30-Oct-2022 |
Pingfan Liu <kernelfans@gmail.com> |
srcu: Delegate work to the boot cpu if using SRCU_SIZE_SMALL Commit 994f706872e6 ("srcu: Make Tree SRCU able to operate without snp_node array") assumes that cpu 0 is always online. However, there really are situations when some other CPU is the boot CPU, for example, when booting a kdump kernel with the maxcpus=1 boot parameter. On PowerPC, the kdump kernel can hang as follows: ... [ 1.740036] systemd[1]: Hostname set to <xyz.com> [ 243.686240] INFO: task systemd:1 blocked for more than 122 seconds. [ 243.686264] Not tainted 6.1.0-rc1 #1 [ 243.686272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.686281] task:systemd state:D stack:0 pid:1 ppid:0 flags:0x00042000 [ 243.686296] Call Trace: [ 243.686301] [c000000016657640] [c000000016657670] 0xc000000016657670 (unreliable) [ 243.686317] [c000000016657830] [c00000001001dec0] __switch_to+0x130/0x220 [ 243.686333] [c000000016657890] [c000000010f607b8] __schedule+0x1f8/0x580 [ 243.686347] [c000000016657940] [c000000010f60bb4] schedule+0x74/0x140 [ 243.686361] [c0000000166579b0] [c000000010f699b8] schedule_timeout+0x168/0x1c0 [ 243.686374] [c000000016657a80] [c000000010f61de8] __wait_for_common+0x148/0x360 [ 243.686387] [c000000016657b20] [c000000010176bb0] __flush_work.isra.0+0x1c0/0x3d0 [ 243.686401] [c000000016657bb0] [c0000000105f2768] fsnotify_wait_marks_destroyed+0x28/0x40 [ 243.686415] [c000000016657bd0] [c0000000105f21b8] fsnotify_destroy_group+0x68/0x160 [ 243.686428] [c000000016657c40] [c0000000105f6500] inotify_release+0x30/0xa0 [ 243.686440] [c000000016657cb0] [c0000000105751a8] __fput+0xc8/0x350 [ 243.686452] [c000000016657d00] [c00000001017d524] task_work_run+0xe4/0x170 [ 243.686464] [c000000016657d50] [c000000010020e94] do_notify_resume+0x134/0x140 [ 243.686478] [c000000016657d80] [c00000001002eb18] interrupt_exit_user_prepare_main+0x198/0x270 [ 243.686493] [c000000016657de0] [c00000001002ec60] syscall_exit_prepare+0x70/0x180 [ 243.686505] [c000000016657e10] [c00000001000bf7c] system_call_vectored_common+0xfc/0x280 [ 243.686520] --- interrupt: 3000 at 0x7fffa47d5ba4 [ 243.686528] NIP: 00007fffa47d5ba4 LR: 0000000000000000 CTR: 0000000000000000 [ 243.686538] REGS: c000000016657e80 TRAP: 3000 Not tainted (6.1.0-rc1) [ 243.686548] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR: 42044440 XER: 00000000 [ 243.686572] IRQMASK: 0 [ 243.686572] GPR00: 0000000000000006 00007ffffa606710 00007fffa48e7200 0000000000000000 [ 243.686572] GPR04: 0000000000000002 000000000000000a 0000000000000000 0000000000000001 [ 243.686572] GPR08: 000001000c172dd0 0000000000000000 0000000000000000 0000000000000000 [ 243.686572] GPR12: 0000000000000000 00007fffa4ff4bc0 0000000000000000 0000000000000000 [ 243.686572] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 243.686572] GPR20: 0000000132dfdc50 000000000000000e 0000000000189375 0000000000000000 [ 243.686572] GPR24: 00007ffffa606ae0 0000000000000005 000001000c185490 000001000c172570 [ 243.686572] GPR28: 000001000c172990 000001000c184850 000001000c172e00 00007fffa4fedd98 [ 243.686683] NIP [00007fffa47d5ba4] 0x7fffa47d5ba4 [ 243.686691] LR [0000000000000000] 0x0 [ 243.686698] --- interrupt: 3000 [ 243.686708] INFO: task kworker/u16:1:24 blocked for more than 122 seconds. [ 243.686717] Not tainted 6.1.0-rc1 #1 [ 243.686724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.686733] task:kworker/u16:1 state:D stack:0 pid:24 ppid:2 flags:0x00000800 [ 243.686747] Workqueue: events_unbound fsnotify_mark_destroy_workfn [ 243.686758] Call Trace: [ 243.686762] [c0000000166736e0] [c00000004fd91000] 0xc00000004fd91000 (unreliable) [ 243.686775] [c0000000166738d0] [c00000001001dec0] __switch_to+0x130/0x220 [ 243.686788] [c000000016673930] [c000000010f607b8] __schedule+0x1f8/0x580 [ 243.686801] [c0000000166739e0] [c000000010f60bb4] schedule+0x74/0x140 [ 243.686814] [c000000016673a50] [c000000010f699b8] schedule_timeout+0x168/0x1c0 [ 243.686827] [c000000016673b20] [c000000010f61de8] __wait_for_common+0x148/0x360 [ 243.686840] [c000000016673bc0] [c000000010210840] __synchronize_srcu.part.0+0xa0/0xe0 [ 243.686855] [c000000016673c30] [c0000000105f2c64] fsnotify_mark_destroy_workfn+0xc4/0x1a0 [ 243.686868] [c000000016673ca0] [c000000010174ea8] process_one_work+0x2a8/0x570 [ 243.686882] [c000000016673d40] [c000000010175208] worker_thread+0x98/0x5e0 [ 243.686895] [c000000016673dc0] [c0000000101828d4] kthread+0x124/0x130 [ 243.686908] [c000000016673e10] [c00000001000cd40] ret_from_kernel_thread+0x5c/0x64 [ 366.566274] INFO: task systemd:1 blocked for more than 245 seconds. [ 366.566298] Not tainted 6.1.0-rc1 #1 [ 366.566305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 366.566314] task:systemd state:D stack:0 pid:1 ppid:0 flags:0x00042000 [ 366.566329] Call Trace: ... The above splat occurs because PowerPC really does use maxcpus=1 instead of nr_cpus=1 in the kernel command line. Consequently, the (quite possibly non-zero) kdump CPU is the only online CPU in the kdump kernel. SRCU unconditionally queues a sdp->work on cpu 0, for which no worker thread has been created, so sdp->work will be never executed and __synchronize_srcu() will never be completed. This commit therefore replaces CPU ID 0 with get_boot_cpu_id() in key places in Tree SRCU. Since the CPU indicated by get_boot_cpu_id() is guaranteed to be online, this avoids the above splat. Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: rcu@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
e29a4915 |
|
13-Oct-2022 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Debug NMI safety even on archs that don't require it Currently the NMI safety debugging is only performed on architectures that don't support NMI-safe this_cpu_inc(). Reorder the code so that other architectures like x86 also detect bad uses. [ paulmck: Apply kernel test robot, Stephen Rothwell, and Zqiang feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
ae3c0706 |
|
13-Oct-2022 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Explain the reason behind the read side critical section on GP start Tell about the need to protect against concurrent updaters who may overflow the GP counter behind the current update. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
6b77bb9b |
|
13-Oct-2022 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Warn when NMI-unsafe API is used in NMI Using the NMI-unsafe reader API from within an NMI handler is very likely to be buggy for three reasons: 1) NMIs aren't strictly re-entrant (a pending nested NMI will execute at the end of the current one) so it should be fine to use a non-atomic increment here. However, breakpoints can still interrupt NMIs and if a breakpoint callback has a reader on that same ssp, a racy increment can happen. 2) If the only reader site for a given srcu_struct structure is in an NMI handler, then RCU should be used instead of SRCU. 3) Because of the previous reason (2), an srcu_struct structure having an SRCU read side critical section in an NMI handler is likely to have another one from a task context. For all these reasons, warn if an NMI-unsafe reader API is used from an NMI handler. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
36f65f1d |
|
20-Sep-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Check for consistent global per-srcu_struct NMI safety This commit adds runtime checks to verify that a given srcu_struct uses consistent NMI-safe (or not) read-side primitives globally, but based on the per-CPU data. These global checks are made by the grace-period code that must scan the srcu_data structures anyway, and are done only in kernels built with CONFIG_PROVE_RCU=y. Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
|
#
27120e7d |
|
19-Sep-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Check for consistent per-CPU per-srcu_struct NMI safety This commit adds runtime checks to verify that a given srcu_struct uses consistent NMI-safe (or not) read-side primitives on a per-CPU basis. Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
|
#
2e83b879 |
|
15-Sep-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Create an srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe() On strict load-store architectures, the use of this_cpu_inc() by srcu_read_lock() and srcu_read_unlock() is not NMI-safe in TREE SRCU. To see this suppose that an NMI arrives in the middle of srcu_read_lock(), just after it has read ->srcu_lock_count, but before it has written the incremented value back to memory. If that NMI handler also does srcu_read_lock() and srcu_read_lock() on that same srcu_struct structure, then upon return from that NMI handler, the interrupted srcu_read_lock() will overwrite the NMI handler's update to ->srcu_lock_count, but leave unchanged the NMI handler's update by srcu_read_unlock() to ->srcu_unlock_count. This can result in a too-short SRCU grace period, which can in turn result in arbitrary memory corruption. If the NMI handler instead interrupts the srcu_read_unlock(), this can result in eternal SRCU grace periods, which is not much better. This commit therefore creates a pair of new srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe() functions, which allow SRCU readers in both NMI handlers and in process and IRQ context. It is bad practice to mix the existing and the new _nmisafe() primitives on the same srcu_struct structure. Use one set or the other, not both. Just to underline that "bad practice" point, using srcu_read_lock() at process level and srcu_read_lock_nmisafe() in your NMI handler will not, repeat NOT, work. If you do not immediately understand why this is the case, please review the earlier paragraphs in this commit log. [ paulmck: Apply kernel test robot feedback. ] [ paulmck: Apply feedback from Randy Dunlap. ] [ paulmck: Apply feedback from John Ogness. ] [ paulmck: Apply feedback from Frederic Weisbecker. ] Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
|
#
5d0f5953 |
|
15-Sep-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Convert ->srcu_lock_count and ->srcu_unlock_count to atomic NMI-safe variants of srcu_read_lock() and srcu_read_unlock() are needed by printk(), which on many architectures entails read-modify-write atomic operations. This commit prepares Tree SRCU for this change by making both ->srcu_lock_count and ->srcu_unlock_count by atomic_long_t. [ paulmck: Apply feedback from John Ogness. ] Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
|
#
4f2bfd94 |
|
30-Jun-2022 |
Neeraj Upadhyay <quic_neeraju@quicinc.com> |
srcu: Make expedited RCU grace periods block even less frequently The purpose of commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was to prevent a long series of never-blocking expedited SRCU grace periods from blocking kernel-live-patching (KLP) progress. Although it was successful, it also resulted in excessive boot times on certain embedded workloads running under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive" means increasing the boot time up into the three-to-four minute range. This increase in boot time was due to the more than 6000 back-to-back invocations of synchronize_rcu_expedited() within the KVM host OS, which in turn resulted from qemu's emulation of a long series of MMIO accesses. Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") did not significantly help this particular use case. Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values of non-sleeping per phase counts on a system with preemption enabled, and observed the following boot times: +──────────────────────────+────────────────+ | SRCU_MAX_NODELAY_PHASE | Boot time (s) | +──────────────────────────+────────────────+ | 100 | 30.053 | | 150 | 25.151 | | 200 | 20.704 | | 250 | 15.748 | | 500 | 11.401 | | 1000 | 11.443 | | 10000 | 11.258 | | 1000000 | 11.154 | +──────────────────────────+────────────────+ Analysis on the experiment results show additional improvements with CPU-bound delays approaching one jiffy in duration. This improvement was also seen when number of per-phase iterations were scaled to one jiffy. This commit therefore scales per-grace-period phase number of non-sleeping polls so that non-sleeping polls extend for about one jiffy. In addition, the delay-calculation call to srcu_get_delay() in srcu_gp_end() is replaced with a simple check for an expedited grace period. This change schedules callback invocation immediately after expedited grace periods complete, which results in greatly improved boot times. Testing done by Marc and Zhangfei confirms that this change recovers most of the performance degradation in boottime; for CONFIG_HZ_250 configuration, specifically, boot times improve from 3m50s to 41s on Marc's setup; and from 2m40s to ~9.7s on Zhangfei's setup. In addition to the changes to default per phase delays, this change adds 3 new kernel parameters - srcutree.srcu_max_nodelay, srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay. This allows users to configure the srcu grace period scanning delays in order to more quickly react to additional use cases. Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: yueluck <yueluck@163.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Tested-by: Marc Zyngier <maz@kernel.org> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
8f870e6e |
|
12-Jun-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Block less aggressively for expedited grace periods Commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") fixed a problem where a long-running expedited SRCU grace period could block kernel live patching. It did so by giving up on expediting once a given SRCU expedited grace period grew too old. Unfortunately, this added excessive delays to boots of virtual embedded systems specifying "-bios QEMU_EFI.fd" to qemu. This commit therefore makes the transition away from expediting less aggressive, increasing the per-grace-period phase number of non-sleeping polls of readers from one to three and increasing the required grace-period age from one jiffy (actually from zero to one jiffies) to two jiffies (actually from one to two jiffies). Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: chenxiang (M)" <chenxiang66@hisilicon.com> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
|
#
586e31d5 |
|
15-Mar-2022 |
Lukas Bulwahn <lukas.bulwahn@gmail.com> |
srcu: Drop needless initialization of sdp in srcu_gp_start() Commit 9c7ef4c30f12 ("srcu: Make Tree SRCU able to operate without snp_node array") initializes the local variable sdp differently depending on the srcu's state in srcu_gp_start(). Either way, this initialization overwrites the value used when sdp is defined. This commit therefore drops this pointless definition-time initialization. Although there is no functional change, compiler code generation may be affected. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
282d8998 |
|
08-Mar-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Prevent expedited GPs and blocking readers from consuming CPU If an SRCU reader blocks while a synchronize_srcu_expedited() waits for that same reader, then that grace period will spawn an endless series of workqueue handlers, consuming a full CPU. This quickly gets pointless because consuming more CPU isn't going to make that reader get done faster, especially if it is blocked waiting for an external event. This commit therefore spawns at most one pair of back-to-back workqueue handlers per expedited grace period phase, instead inserting increasing delays as that grace period phase grows older, but capped at 10 jiffies. In any case, if there have been at least 100 back-to-back workqueue handlers within a single jiffy, regardless of grace period or grace-period phase, then a one-jiffy delay is inserted. [ paulmck: Apply feedback from kernel test robot. ] Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reported-by: Song Liu <song@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
c2445d38 |
|
31-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add contention check to call_srcu() srcu_data ->lock acquisition This commit increases the sensitivity of contention detection by adding checks to the acquisition of the srcu_data structure's lock on the call_srcu() code path. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
a57ffb3c |
|
31-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Automatically determine size-transition strategy at boot This commit adds a srcutree.convert_to_big option of zero that causes SRCU to decide at boot whether to wait for contention (small systems) or immediately expand to large (large systems). A new srcutree.big_cpu_lim (defaulting to 128) defines how many CPUs constitute a large system. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
9f2e91d9 |
|
27-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add contention-triggered addition of srcu_node tree This commit instruments the acquisitions of the srcu_struct structure's ->lock, enabling the initiation of a transition from SRCU_SIZE_SMALL to SRCU_SIZE_BIG when sufficient contention is experienced. The instrumentation counts the number of trylock failures within the confines of a single jiffy. If that number exceeds the value specified by the srcutree.small_contention_lim kernel boot parameter (which defaults to 100), and if the value specified by the srcutree.convert_to_big kernel boot parameter has the 0x10 bit set (defaults to 0), then a transition will be automatically initiated. By default, there will never be any transitions, so that none of the srcu_struct structures ever gains an srcu_node array. The useful values for srcutree.convert_to_big are: 0x00: Never convert. 0x01: Always convert at init_srcu_struct() time. 0x02: Convert when rcutorture prints its first round of statistics. 0x03: Decide conversion approach at boot given system size. 0x10: Convert if contention is encountered. 0x12: Convert if contention is encountered or when rcutorture prints its first round of statistics, whichever comes first. The value 0x11 acts the same as 0x01 because the conversion happens before there is any chance of contention. [ paulmck: Apply "static" feedback from kernel test robot. ] Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
99659f64 |
|
27-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Create concurrency-safe helper for initiating size transition Once there are contention-initiated size transitions, it will be possible for rcutorture to initiate a transition at the same time as a contention-initiated transition. This commit therefore creates a concurrency-safe helper function named srcu_transition_to_big() to safely initiate size transitions. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
ee5e2448 |
|
27-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Explain srcu_funnel_gp_start() call to list_add() is safe This commit adds a comment explaining why an unprotected call to list_add() from srcu_funnel_gp_start() can be safe. TL;DR: It is only called during very early boot when we don't have no steeking concurrency! Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
46470cf8 |
|
27-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Prevent cleanup_srcu_struct() from freeing non-dynamic ->sda When an srcu_struct structure is created (but not in a kernel module) by DEFINE_SRCU() and friends, the per-CPU srcu_data structure is statically allocated. In all other cases, that structure is obtained from alloc_percpu(), in which case cleanup_srcu_struct() must invoke free_percpu() on the resulting ->sda pointer in the srcu_struct pointer. Which it does. Except that it also invokes free_percpu() on the ->sda pointer referencing the statically allocated per-CPU srcu_data structures. Which free_percpu() is surprisingly OK with. This commit nevertheless stops cleanup_srcu_struct() from freeing statically allocated per-CPU srcu_data structures. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
4a230f80 |
|
27-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Avoid NULL dereference in srcu_torture_stats_print() You really shouldn't invoke srcu_torture_stats_print() after invoking cleanup_srcu_struct(), but there is really no reason to get a compiler-obfuscated per-CPU-variable NULL pointer dereference as the diagnostic. This commit therefore checks for NULL ->sda and makes a more polite console-message complaint in that case. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
c69a00a1 |
|
25-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add boot-time control over srcu_node array allocation This commit adds an srcu_tree.convert_to_big kernel parameter that either refuses to convert at all (0), converts immediately at init_srcu_struct() time (1), or lets rcutorture convert it (2). An addition contention-based dynamic conversion choice will be added, along with documentation. [ paulmck: Apply callback-scanning feedback from Neeraj Upadhyay. ] Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
0b56f953 |
|
21-Feb-2022 |
Neeraj Upadhyay <quic_neeraju@quicinc.com> |
srcu: Ensure snp nodes tree is fully initialized before traversal For configurations where snp node tree is not initialized at init time (added in subsequent commits), srcu_funnel_gp_start() and srcu_funnel_exp_start() can potential traverse and observe the snp nodes' transient (uninitialized) states. This can potentially happen, when init_srcu_struct_nodes() initialization of sdp->mynode races with srcu_funnel_gp_start() and srcu_funnel_exp_start() Consider the case below where srcu_funnel_gp_start() observes sdp->mynode to be not NULL and uses an uninitialized sdp->grpmask P1 P2 init_srcu_struct_nodes() void srcu_funnel_gp_start(...) { for_each_possible_cpu(cpu) { ... sdp->mynode = &snp_first[...]; for (snp = sdp->mynode;...) struct srcu_node *snp_leaf = smp_load_acquire(&sdp->mynode) ... if (snp_leaf) { for (snp = snp_leaf; ...) ... if (snp == snp_leaf) snp->srcu_data_have_cbs[idx] |= sdp->grpmask; sdp->grpmask = 1 << (cpu - sdp->mynode->grplo); } } Similarly, init_srcu_struct_nodes() and srcu_funnel_exp_start() can race, where srcu_funnel_exp_start() could observe state of snp lock before spin_lock_init(). P1 P2 init_srcu_struct_nodes() void srcu_funnel_exp_start(...) { srcu_for_each_node_breadth_first(ssp, snp) { for (; ...) { spin_lock_...(snp, ) spin_lock_init(&ACCESS_PRIVATE(snp, lock)); ... } for_each_possible_cpu(cpu) { ... sdp->mynode = &snp_first[...]; To avoid these issues, ensure that snp node tree initialization is complete i.e. after SRCU_SIZE_WAIT_BARRIER srcu_size_state is reached, before traversing the tree. Given that srcu_funnel_gp_start() and srcu_funnel_exp_start() are called within SRCU read side critical sections, this check is safe, in the sense that all callbacks are enqueued on CPU0 srcu_cblist until SRCU_SIZE_WAIT_CALL is entered, and these read side critical sections (containing srcu_funnel_gp_start() and srcu_funnel_exp_start()) need to complete, before SRCU_SIZE_WAIT_CALL is reached. Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
cbdc98e9 |
|
26-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Use invalid initial value for srcu_node GP sequence numbers Currently, tree SRCU relies on the srcu_node structures being initialized at the same time that the srcu_struct itself is initialized, and thus use the initial grace-period sequence number as the initial value for the srcu_node structure's ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp fields. Although this has a high probability of also working when the srcu_node array is allocated and initialized at some random later time, it would be better to avoid leaving such things to chance. This commit therefore initializes these fields with 0x2, which is a recognizable invalid value. It then adds the required checks for this invalid value in order to avoid confusion on long-running kernels (especially those on 32-bit systems) that allocate and initialize srcu_node arrays late in life. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
aeb9b39b |
|
26-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Compute snp_seq earlier in srcu_funnel_gp_start() Currently, srcu_funnel_gp_start() tests snp->srcu_have_cbs[idx] and then separately assigns it to the snp_seq local variable. This commit does the assignment earlier to simplify the code a bit. While in the area, this commit also takes advantage of the 100-character line limit to put the call to srcu_schedule_cbs_sdp() on a single line. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
3bedebcf |
|
24-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make rcutorture dump the SRCU size state This commit adds the numeric and string version of ->srcu_size_state to the Tree-SRCU-specific portion of the rcutorture output. [ paulmck: Apply feedback from kernel test robot and Dan Carpenter. ] [ quic_neeraju: Apply feedback from Jiapeng Chong. ] Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
e2f63836 |
|
24-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add size-state transitioning code This is just dead code at the moment, and will be used once the state-transition code is activated. Because srcu_barrier() must be aware of transition before call_srcu(), the state machine waits for an SRCU grace period before callbacks are queued to the non-CPU-0 queues. This requres that portions of srcu_barrier() be enclosed in an SRCU read-side critical section. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
2ec30311 |
|
21-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Dynamically allocate srcu_node array This commit shrinks the srcu_struct structure by converting its ->node field from a fixed-size compile-time array to a pointer to a dynamically allocated array. In kernels built with large values of NR_CPUS that boot on systems with smaller numbers of CPUs, this can save significant memory. [ paulmck: Apply kernel test robot feedback. ] Reported-by: A cast of thousands Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
994f7068 |
|
24-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make Tree SRCU able to operate without snp_node array This commit makes Tree SRCU able to operate without an snp_node array, that is, when the srcu_data structures' ->mynode pointers are NULL. This can result in high contention on the srcu_struct structure's ->lock, but only when there are lots of call_srcu(), synchronize_srcu(), and synchronize_srcu_expedited() calls. Note that when there is no snp_node array, all SRCU callbacks use CPU 0's callback queue. This is optimal in the common case of low update-side load because it removes the need to search each CPU for the single callback that made the grace period happen. Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
7b9e9b58 |
|
20-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make srcu_funnel_gp_start() cache ->mynode in snp_leaf Currently, the srcu_funnel_gp_start() walks its local variable snp up the tree and reloads sdp->mynode whenever it is necessary to check whether it is still at the leaf srcu_node level. This works, but is a bit more obtuse than absolutely necessary. In addition, upcoming commits will dynamically size srcu_struct structures, in which case sdp->mynode will no longer necessarily be a constant, and this commit helps prepare for that dynamic sizing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
8ed00760 |
|
12-Jan-2022 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Tighten cleanup_srcu_struct() GP checks Currently, cleanup_srcu_struct() checks for a grace period in progress, but it does not check for a grace period that has not yet started but which might start at any time. Such a situation could result in a use-after-free bug, so this commit adds a check for a grace period that is needed but not yet started to cleanup_srcu_struct(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
a616aec9 |
|
22-Mar-2021 |
Ingo Molnar <mingo@kernel.org> |
rcu: Fix various typos in comments Fix ~12 single-word typos in RCU code comments. [ paulmck: Apply feedback from Randy Dunlap. ] Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
3d3a0d1b |
|
16-Apr-2021 |
Paul E. McKenney <paulmck@kernel.org> |
rcu: Point to documentation of ordering guarantees Add comments to synchronize_rcu() and friends that point to Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
b5befe84 |
|
17-Apr-2021 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Fix broken node geometry after early ssp init An srcu_struct structure that is initialized before rcu_init_geometry() will have its srcu_node hierarchy based on CONFIG_NR_CPUS. Once rcu_init_geometry() is called, this hierarchy is compressed as needed for the actual maximum number of CPUs for this system. Later on, that srcu_struct structure is confused, sometimes referring to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead to the new num_possible_cpus() hierarchy. For example, each of its ->mynode fields continues to reference the original leaf rcu_node structures, some of which might no longer exist. On the other hand, srcu_for_each_node_breadth_first() traverses to the new node hierarchy. There are at least two bad possible outcomes to this: 1) a) A callback enqueued early on an srcu_data structure (call it *sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf (say 3 levels). b) The grace period ends after rcu_init_geometry() shrinks the nodes level to a single one. srcu_gp_end() walks through the new srcu_node hierarchy without ever reaching the old leaves so the callback is never executed. This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests verification never completes and the boot hangs: [ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds. [ 5413.147564] Not tainted 5.12.0-rc4+ #28 [ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5413.159753] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000 [ 5413.168099] Call Trace: [ 5413.170555] __schedule+0x36c/0x930 [ 5413.174057] ? wait_for_completion+0x88/0x110 [ 5413.178423] schedule+0x46/0xf0 [ 5413.181575] schedule_timeout+0x284/0x380 [ 5413.185591] ? wait_for_completion+0x88/0x110 [ 5413.189957] ? mark_held_locks+0x61/0x80 [ 5413.193882] ? mark_held_locks+0x61/0x80 [ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50 [ 5413.202173] ? wait_for_completion+0x88/0x110 [ 5413.206535] wait_for_completion+0xb4/0x110 [ 5413.210724] ? srcu_torture_stats_print+0x110/0x110 [ 5413.215610] srcu_barrier+0x187/0x200 [ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50 [ 5413.224244] ? rdinit_setup+0x2b/0x2b [ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40 [ 5413.232700] do_one_initcall+0x63/0x310 [ 5413.236541] ? rdinit_setup+0x2b/0x2b [ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80 [ 5413.244912] kernel_init_freeable+0x253/0x28f [ 5413.249273] ? rest_init+0x250/0x250 [ 5413.252846] kernel_init+0xa/0x110 [ 5413.256257] ret_from_fork+0x22/0x30 2) An srcu_struct structure that is initialized before rcu_init_geometry() and used afterward will always have stale rdp->mynode references, resulting in callbacks to be missed in srcu_gp_end(), just like in the previous scenario. This commit therefore causes init_srcu_struct_nodes to initialize the geometry, if needed. This ensures that the srcu_node hierarchy is properly built and distributed from the get-go. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
8e9c01c7 |
|
08-Apr-2021 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Initialize SRCU after timers Once srcu_init() is called, the SRCU core will make use of delayed workqueues, which rely on timers. However init_timers() is called several steps after rcu_init(). This means that a call_srcu() after rcu_init() but before init_timers() would find itself within a dangerously uninitialized timer core. This commit therefore creates a separate call to srcu_init() after init_timer() completes, which ensures that we stay in early SRCU mode until timers are safe(r). Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
c75e9d29 |
|
01-Apr-2021 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Remove superfluous ssp initialization for early callbacks Pre-srcu_init() invocations of call_srcu() initialize the srcu_struct structure in question, so there is no need to check this initialization in srcu_init() when initiating grace periods for srcu_struct structures that had early call_srcu() invocations. This commit therefore drops the calls to check_init_srcu_struct() in srcu_init(). Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
94df76a1 |
|
01-Apr-2021 |
Frederic Weisbecker <frederic@kernel.org> |
srcu: Remove superfluous sdp->srcu_lock_count zero filling Because alloc_percpu() zeroes out the allocated memory, there is no need to zero-fill newly allocated per-CPU memory. This commit therefore removes the loop zeroing the ->srcu_lock_count and ->srcu_unlock_count arrays from init_srcu_struct_nodes(). This is the only use of that function's is_static parameter, which this commit also removes. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
ae5c2341 |
|
23-Sep-2020 |
Joel Fernandes (Google) <joel@joelfernandes.org> |
rcu/segcblist: Add counters to segcblist datastructure Add counting of segment lengths of segmented callback list. This will be useful for a number of things such as knowing how big the ready-to-execute segment have gotten. The immediate benefit is ability to trace how the callbacks in the segmented callback list change. Also this patch remove hacks related to using donecbs's ->len field as a temporary variable to save the segmented callback list's length. This cannot be done anymore and is not needed. Also fix SRCU: The negative counting of the unsegmented list cannot be used to adjust the segmented one. To fix this, sample the unsegmented length in advance, and use it after CB execution to adjust the segmented list's length. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
4e7ccfae |
|
15-Nov-2020 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add comment explaining cookie overflow/wrap This commit adds to the poll_state_synchronize_srcu() header comment describing the issues surrounding SRCU cookie overflow/wrap for the different kernel configurations. Link: https://lore.kernel.org/rcu/20201112201547.GF3365678@moria.home.lan/ Reported-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
5358c9fa |
|
13-Nov-2020 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Provide polling interfaces for Tree SRCU grace periods There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.GF3365678@moria.home.lan/ Reported-by: Kent Overstreet <kent.overstreet@gmail.com> [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/lkml/20201117004017.GA7444@paulmck-ThinkPad-P72/ Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
29d2bb94 |
|
13-Nov-2020 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Provide internal interface to start a Tree SRCU grace period There is a need for a polling interface for SRCU grace periods. This polling needs to initiate an SRCU grace period without having to queue (and manage) a callback. This commit therefore splits the Tree SRCU __call_srcu() function into callback-initialization and queuing/start-grace-period portions, with the latter in a new function named srcu_gp_start_if_needed(). This function may be passed a NULL callback pointer, in which case it will refrain from queuing anything. Why have the new function mess with queuing? Locking considerations, of course! Link: https://lore.kernel.org/rcu/20201112201547.GF3365678@moria.home.lan/ Reported-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
50edb988 |
|
10-Sep-2020 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Take early exit on memory-allocation failure It turns out that init_srcu_struct() can be invoked from usermode tasks, and that fatal signals received by these tasks can cause memory-allocation failures. These failures are not handled well by init_srcu_struct(), so much so that NULL pointer dereferences can result. This commit therefore causes init_srcu_struct() to take an early exit upon detection of memory-allocation failure. Link: https://lore.kernel.org/lkml/20200908144306.33355-1-aik@ozlabs.ru/ Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
f505d434 |
|
16-Sep-2020 |
Jakub Kicinski <kuba@kernel.org> |
srcu: Use a more appropriate lockdep helper The lockdep_is_held() macro is defined as: #define lockdep_is_held(lock) lock_is_held(&(lock)->dep_map) This hides away the dereference, so that builds with !LOCKDEP don't break. This works in current kernels because the RCU_LOCKDEP_WARN() eliminates its condition at preprocessor time in !LOCKDEP kernels. However, later patches in this series will cause the compiler to see this condition even in !LOCKDEP kernels. This commit prepares for this upcoming change by switching from lock_is_held() to lockdep_is_held(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> -- CC: jiangshanlai@gmail.com CC: paulmck@kernel.org CC: josh@joshtriplett.org CC: rostedt@goodmis.org CC: mathieu.desnoyers@efficios.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
d9b60741 |
|
17-Jun-2020 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Remove KCSAN stubs KCSAN is now in mainline, so this commit removes the stubs for the data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS() macros. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
bde50d8f |
|
26-May-2020 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
srcu: Avoid local_irq_save() before acquiring spinlock_t SRCU disables interrupts to get a stable per-CPU pointer and then acquires the spinlock which is in the per-CPU data structure. The release uses spin_unlock_irqrestore(). While this is correct on a non-RT kernel, this conflicts with the RT semantics because the spinlock is converted to a 'sleeping' spinlock. Sleeping locks can obviously not be acquired with interrupts disabled. Acquire the per-CPU pointer `ssp->sda' without disabling preemption and then acquire the spinlock_t of the per-CPU data structure. The lock will ensure that the data is consistent. The added call to check_init_srcu_struct() is now needed because a statically defined srcu_struct may remain uninitialized until this point and the newly introduced locking operation requires an initialized spinlock_t. This change was tested for four hours with 8*SRCU-N and 8*SRCU-P without causing any warnings. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: rcu@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
7fef6cff |
|
18-Apr-2020 |
Ethon Paul <ethp@qq.com> |
srcu: Fix a typo in comment "amoritized"->"amortized" This commit fixes a typo in a comment. Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
b68c6146 |
|
03-Jan-2020 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add data_race() to ->srcu_lock_count and ->srcu_unlock_count arrays The srcu_data structure's ->srcu_lock_count and ->srcu_unlock_count arrays are read and written locklessly, so this commit adds the data_race() to the diagnostic-print loads from these arrays in order mark them as known and approved data-racy accesses. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to this being used only by rcutorture. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
4f58820f |
|
13-Apr-2020 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add KCSAN stubs This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to move ahead. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
71042606 |
|
03-Jan-2020 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Hold srcu_struct ->lock when updating ->srcu_gp_seq A read of the srcu_struct structure's ->srcu_gp_seq field should not need READ_ONCE() when that structure's ->lock is held. Except that this lock is not always held when updating this field. This commit therefore acquires the lock around updates and removes a now-unneeded READ_ONCE(). This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ paulmck: Switch from READ_ONCE() to lock per Peter Zilstra question. ] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
#
39f91504 |
|
22-Dec-2019 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Fix process_srcu()/srcu_batches_completed() datarace The srcu_struct structure's ->srcu_idx field is accessed locklessly, so reads must use READ_ONCE(). This commit therefore adds the needed READ_ONCE() invocation where it was missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
8c9e0cb3 |
|
22-Dec-2019 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Fix __call_srcu()/srcu_get_delay() datarace The srcu_struct structure's ->srcu_gp_seq_needed_exp field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocations. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
7ff8b450 |
|
22-Dec-2019 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Fix __call_srcu()/process_srcu() datarace The srcu_node structure's ->srcu_gp_seq_needed_exp field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocations. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
65bb0dc4 |
|
06-Jan-2020 |
SeongJae Park <sjpark@amazon.de> |
rcu: Fix typos in file-header comments Convert to plural and add a note that this is for Tree RCU. Signed-off-by: SeongJae Park <sjpark@amazon.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
844a378d |
|
04-Nov-2019 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Apply *_ONCE() to ->srcu_last_gp_end The ->srcu_last_gp_end field is accessed from any CPU at any time by synchronize_srcu(), so non-initialization references need to use READ_ONCE() and WRITE_ONCE(). This commit therefore makes that change. Reported-by: syzbot+08f3e9d26e5541e1ecf2@syzkaller.appspotmail.com Acked-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
77a40f97 |
|
29-Aug-2019 |
Joel Fernandes (Google) <joel@joelfernandes.org> |
rcu: Remove kfree_rcu() special casing and lazy-callback handling This commit removes kfree_rcu() special-casing and the lazy-callback handling from Tree RCU. It moves some of this special casing to Tiny RCU, the removal of which will be the subject of later commits. This results in a nice negative delta. Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> [ paulmck: Add slab.h #include, thanks to kbuild test robot <lkp@intel.com>. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
#
7e210a65 |
|
28-Jun-2019 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Avoid srcutorture security-based pointer obfuscation Because pointer output is now obfuscated, and because what you really want to know is whether or not the callback lists are empty, this commit replaces the srcu_data structure's head callback pointer printout with a single character that is "." is the callback list is empty or "C" otherwise. This is the only remaining user of rcu_segcblist_head(), so this commit also removes this function's definition. It also turns out that rcu_segcblist_tail() no longer has any callers, so this commit removes that function's definition while in the area. They were both marked "Interim", and their end has come. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
#
11b00045 |
|
22-Apr-2019 |
Jiang Biao <benbjiang@tencent.com> |
rcu: Make __call_srcu static Because __call_srcu() is not used outside kernel/rcu/srcutree.c, this commit makes it static. Signed-off-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
#
fe15b50c |
|
05-Apr-2019 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Allocate per-CPU data for DEFINE_SRCU() in modules Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module requires that the size of the reserved region be increased, which is not something we want to be doing all that often. One approach would be to require that loadable modules define an srcu_struct and invoke init_srcu_struct() from their module_init function and cleanup_srcu_struct() from their module_exit function. However, this is more than a bit user unfriendly. This commit therefore creates an ___srcu_struct_ptrs linker section, and pointers to srcu_struct structures created by DEFINE_SRCU() and DEFINE_STATIC_SRCU() within a module are placed into that module's ___srcu_struct_ptrs section. The required init_srcu_struct() and cleanup_srcu_struct() functions are then automatically invoked as needed when that module is loaded and unloaded, thus allowing modules to continue to use DEFINE_SRCU() and DEFINE_STATIC_SRCU() while avoiding the need to increase the size of the reserved region. Many of the algorithms and some of the code was cheerfully cherry-picked from other code making use of linker sections, perhaps most notably from tracepoints. All bugs are nevertheless the sole property of the author. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> [ paulmck: Use __section() and use "default" in srcu_module_notify()'s "switch" statement as suggested by Joel Fernandes. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
#
f5ad3991 |
|
13-Feb-2019 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Remove cleanup_srcu_struct_quiesced() The cleanup_srcu_struct_quiesced() function was added because NVME used WQ_MEM_RECLAIM workqueues and SRCU did not, which meant that NVME workqueues waiting on SRCU workqueues could result in deadlocks during low-memory conditions. However, SRCU now also has WQ_MEM_RECLAIM workqueues, so there is no longer a potential for deadlock. Furthermore, it turns out to be extremely hard to use cleanup_srcu_struct_quiesced() correctly due to the fact that SRCU callback invocation accesses the srcu_struct structure's per-CPU data area just after callbacks are invoked. Therefore, the usual practice of using srcu_barrier() to wait for callbacks to be invoked before invoking cleanup_srcu_struct_quiesced() fails because SRCU's callback-invocation workqueue handler might be delayed, which can result in cleanup_srcu_struct_quiesced() being invoked (and thus freeing the per-CPU data) before the SRCU's callback-invocation workqueue handler is finished using that per-CPU data. Nor is this a theoretical problem: KASAN emitted use-after-free warnings because of this problem on actual runs. In short, NVME can now safely invoke cleanup_srcu_struct(), which avoids the use-after-free scenario. And cleanup_srcu_struct_quiesced() is quite difficult to use safely. This commit therefore removes cleanup_srcu_struct_quiesced(), switching its sole user back to cleanup_srcu_struct(). This effectively reverts the following pair of commits: f7194ac32ca2 ("srcu: Add cleanup_srcu_struct_quiesced()") 4317228ad9b8 ("nvme: Avoid flush dependency in delete controller flow") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org>
|
#
5cdfd174 |
|
12-Feb-2019 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Check for in-flight callbacks in _cleanup_srcu_struct() If someone fails to drain the corresponding SRCU callbacks (for example, by failing to invoke srcu_barrier()) before invoking either cleanup_srcu_struct() or cleanup_srcu_struct_quiesced(), the resulting diagnostic is an ambiguous use-after-free diagnostic, and even then only if you are running something like KASAN. This commit therefore improves SRCU diagnostics by adding checks for in-flight callbacks at _cleanup_srcu_struct() time. Note that these diagnostics can still be defeated, for example, by invoking call_srcu() concurrently with cleanup_srcu_struct(). Which is a really bad idea, but sometimes all too easy to do. But even then, these diagnostics have at least some probability of catching the problem. Reported-by: Sagi Grimberg <sagi@grimberg.me> Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Tested-by: Bart Van Assche <bvanassche@acm.org>
|
#
e7ee1501 |
|
17-Jan-2019 |
Paul E. McKenney <paulmck@kernel.org> |
rcu/srcu: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
|
#
e81baf4c |
|
10-Dec-2018 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
srcu: Remove srcu_queue_delayed_work_on() srcu_queue_delayed_work_on() disables preemption (and therefore CPU hotplug in RCU's case) and then checks based on its own accounting if a CPU is online. If the CPU is online it uses queue_delayed_work_on() otherwise it fallbacks to queue_delayed_work(). The problem here is that queue_work() on -RT does not work with disabled preemption. queue_work_on() works also on an offlined CPU. queue_delayed_work_on() has the problem that it is possible to program a timer on an offlined CPU. This timer will fire once the CPU is online again. But until then, the timer remains programmed and nothing will happen. Add a local timer which will fire (as requested per delay) on the local CPU and then enqueue the work on the specific CPU. RCUtorture testing with SRCU-P for 24h showed no problems. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
#
aacb5d91 |
|
28-Oct-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Use "ssp" instead of "sp" for srcu_struct pointer In RCU, the distinction between "rsp", "rnp", and "rdp" has served well for a great many years, but in SRCU, "sp" vs. "sdp" has proven confusing. This commit therefore renames SRCU's "sp" pointers to "ssp", so that there is "ssp" for srcu_struct pointer, "snp" for srcu_node pointer, and "sdp" for srcu_data pointer. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
|
#
eb4c2382 |
|
26-Oct-2018 |
Dennis Krein <Dennis.Krein@netapp.com> |
srcu: Lock srcu_data structure in srcu_gp_start() The srcu_gp_start() function is called with the srcu_struct structure's ->lock held, but not with the srcu_data structure's ->lock. This is problematic because this function accesses and updates the srcu_data structure's ->srcu_cblist, which is protected by that lock. Failing to hold this lock can result in corruption of the SRCU callback lists, which in turn can result in arbitrarily bad results. This commit therefore makes srcu_gp_start() acquire the srcu_data structure's ->lock across the calls to rcu_segcblist_advance() and rcu_segcblist_accelerate(), thus preventing this corruption. Reported-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: Sebastian Kuzminsky <seb.kuzminsky@gmail.com> Signed-off-by: Dennis Krein <Dennis.Krein@netapp.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Tested-by: Dennis Krein <Dennis.Krein@netapp.com> Cc: <stable@vger.kernel.org> # 4.16.x
|
#
0607ba84 |
|
25-Apr-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Prevent __call_srcu() counter wrap with read-side critical section Ever since cdf7abc4610a ("srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context"), it has been permissible to use SRCU read-side critical sections in interrupt context. This allows __call_srcu() to use SRCU read-side critical sections to prevent a new SRCU grace period from ending before the call to either srcu_funnel_gp_start() or srcu_funnel_exp_start completes, thus preventing SRCU grace-period counter overflow during that time. Note that this does not permit removal of the counter-wrap checks in srcu_gp_end(). These check are necessary to handle the case where a given CPU does not interact at all with SRCU for an extended time period. This commit therefore adds an SRCU read-side critical section to __call_srcu() in order to prevent grace period counter wrap during the funnel-locking process. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
4e6ea4ef |
|
14-Aug-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make early-boot call_srcu() reuse workqueue lists Allocating a list_head structure that is almost never used, and, when used, is used only during early boot (rcu_init() and earlier), is a bit wasteful. This commit therefore eliminates that list_head in favor of the one in the work_struct structure. This is safe because the work_struct structure cannot be used until after rcu_init() returns. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
#
e0fcba9a |
|
14-Aug-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make call_srcu() available during very early boot Event tracing is moving to SRCU in order to take advantage of the fact that SRCU may be safely used from idle and even offline CPUs. However, event tracing can invoke call_srcu() very early in the boot process, even before workqueue_init_early() is invoked (let alone rcu_init()). Therefore, call_srcu()'s attempts to queue work fail miserably. This commit therefore detects this situation, and refrains from attempting to queue work before rcu_init() time, but does everything else that it would have done, and in addition, adds the srcu_struct to a global list. The rcu_init() function now invokes a new srcu_init() function, which is empty if CONFIG_SRCU=n. Otherwise, srcu_init() queues work for each srcu_struct on the list. This all happens early enough in boot that there is but a single CPU with interrupts disabled, which allows synchronization to be dispensed with. Of course, the queued work won't actually be invoked until after workqueue_init() is invoked, which happens shortly after the scheduler is up and running. This means that although call_srcu() may be invoked any time after per-CPU variables have been set up, there is still a very narrow window when synchronize_srcu() won't work, and this window extends from the time that the scheduler starts until the time that workqueue_init() returns. This can be fixed in a manner similar to the fix for synchronize_rcu_expedited() and friends, but until someone actually needs to use synchronize_srcu() during this window, this fix is added churn for no benefit. Finally, note that Tree SRCU's new srcu_init() function invokes queue_work() rather than the queue_delayed_work() function that is invoked post-boot. The reason is that queue_delayed_work() will (as you would expect) post a timer, and timers have not yet been initialized. So use of queue_work() avoids the complaints about use of uninitialized spinlocks that would otherwise result. Besides, some delay is already provide by the aforementioned fact that the queued work won't actually be invoked until after the scheduler is up and running. Requested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
#
6eb95cc4 |
|
07-Jul-2018 |
Paul E. McKenney <paulmck@kernel.org> |
rcu: Clean up flavor-related definitions and comments in srcutree.h Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
aedf4ba9 |
|
04-Jul-2018 |
Paul E. McKenney <paulmck@kernel.org> |
rcu: Remove rsp parameter from rcu_node tree accessor macros There now is only one rcu_state structure in a given build of the Linux kernel, so there is no need to pass it as a parameter to RCU's rcu_node tree's accessor macros. This commit therefore removes the rsp parameter from those macros in kernel/rcu/rcu.h, and removes some now-unused rsp local variables while in the area. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
52e17ba1 |
|
19-Jun-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add grace-period number to rcutorture statistics printout This commit adds the SRCU grace-period number to the rcutorture statistics printout, which allows it to be compared to the rcutorture "Writer stall state" message. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
a7538352 |
|
14-May-2018 |
Joe Perches <joe@perches.com> |
rcu: Use pr_fmt to prefix "rcu: " to logging output This commit also adjusts some whitespace while in the area. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Revert string-breaking %s as requested by Andy Shevchenko. ]
|
#
aebc8264 |
|
01-May-2018 |
Paul E. McKenney <paulmck@kernel.org> |
rcutorture: Convert rcutorture_get_gp_data() to ->gp_seq SRCU has long used ->srcu_gp_seq, and now RCU uses ->gp_seq. This commit therefore moves the rcutorture_get_gp_data() function from a ->gpnum / ->completed pair to ->gp_seq. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
5ab07a8d |
|
22-May-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add address of first callback to rcutorture output This commit adds the address of the first callback to the per-CPU rcutorture output in order to allow lost wakeups to be more efficiently tracked down. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
17294ce6 |
|
25-Apr-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Document that srcu_funnel_gp_start() implies srcu_funnel_exp_start() This commit updates the header comment of srcu_funnel_gp_start() to document the fact that srcu_funnel_gp_start() does the work of srcu_funnel_exp_start(), in some cases by invoking it directly. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
5ef98a63 |
|
24-Apr-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Fix typos in __call_srcu() header comment This commit simply changes some copy-pasta call_rcu() instances to the correct call_srcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
f7194ac3 |
|
05-Apr-2018 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add cleanup_srcu_struct_quiesced() The current cleanup_srcu_struct() flushes work, which prevents it from being invoked from some workqueue contexts, as well as from atomic (non-blocking) contexts. This patch therefore introduced a cleanup_srcu_struct_quiesced(), which can be invoked only after all activity on the specified srcu_struct has completed. This restriction allows cleanup_srcu_struct_quiesced() to be invoked from workqueue contexts as well as from atomic contexts. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Nitzan Carmi <nitzanc@mellanox.com> Tested-by: Nicholas Piggin <npiggin@gmail.com>
|
#
ad7c946b |
|
08-Jan-2018 |
Paul E. McKenney <paulmck@kernel.org> |
rcu: Create RCU-specific workqueues with rescuers RCU's expedited grace periods can participate in out-of-memory deadlocks due to all available system_wq kthreads being blocked and there not being memory available to create more. This commit prevents such deadlocks by allocating an RCU-specific workqueue_struct at early boot time, and providing it with a rescuer to ensure forward progress. This uses the shiny new init_rescuer() function provided by Tejun (but indirectly). This commit also causes SRCU to use this new RCU-specific workqueue_struct. Note that SRCU's use of workqueues never blocks them waiting for readers, so this should be safe from a forward-progress viewpoint. Note that this moves SRCU from system_power_efficient_wq to a normal workqueue. In the unlikely event that this results in measurable degradation, a separate power-efficient workqueue will be creates for SRCU. Reported-by: Prateek Sood <prsood@codeaurora.org> Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Tejun Heo <tj@kernel.org>
|
#
6308f347 |
|
14-Feb-2018 |
Paul E. McKenney <paulmck@kernel.org> |
rcu: Remove SRCU throttling The code in srcu_gp_end() inserts a delay every 0x3ff grace periods in order to prevent SRCU grace-period work from consuming an entire CPU when there is a long sequence of expedited SRCU grace-period requests. However, all of SRCU's grace-period work is carried out in workqueues, which are in turn within kthreads, which are automatically throttled as needed by the scheduler. In particular, if there is plenty of idle time, there is no point in throttling. This commit therefore removes the expedited SRCU grace-period throttling. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
a72da917 |
|
14-Feb-2018 |
Byungchul Park <byungchul.park@lge.com> |
srcu: Remove dead code in srcu_gp_end() Of course, compilers will optimize out a dead code. Anyway, remove any dead code for better readibility. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
8ddbd883 |
|
31-Jan-2018 |
Ildar Ismagilov <devix84@gmail.com> |
srcu: Reduce scans of srcu_data in counter wrap check Currently, given a multi-level srcu_node tree, SRCU can scan the full set of srcu_data structures at each level when cleaning up after a grace period. This, though harmless otherwise, represents pointless overhead. This commit therefore eliminates this overhead by scanning the srcu_data structures only when traversing the leaf srcu_node structures. Signed-off-by: Ildar Ismagilov <devix84@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
a35d13ec |
|
31-Jan-2018 |
Ildar Ismagilov <devix84@gmail.com> |
srcu: Prevent sdp->srcu_gp_seq_needed_exp counter wrap SRCU checks each srcu_data structure's grace-period number for counter wrap four times per cycle by default. This frequency guarantees that normal comparisons will detect potential wrap. However, the expedited grace-period number is not checked. The consquences are not too horrible (a failure to expedite a grace period when requested), but it would be good to avoid such things. This commit therefore adds this check to the expedited grace-period number. Signed-off-by: Ildar Ismagilov <devix84@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
cb4081cd |
|
01-Dec-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Abstract function name This commit moves to __func__ for function names in the name of better resilience to change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
274afd6b |
|
30-Jan-2018 |
Ildar Ismagilov <devix84@gmail.com> |
rcu: Fix misprint in srcu_funnel_exp_start The srcu_funnel_exp_start() function checks to see if the srcu_struct structure's expedited grace period counter needs updating to reflect a newly arrived request for an expedited SRCU grace period. Unfortunately, the check is backwards, so this commit reverses the sense of the test. Signed-off-by: Ildar Ismagilov <devix84@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
a32e01ee |
|
17-Jan-2018 |
Matthew Wilcox <willy@infradead.org> |
rcu: Use wrapper for lockdep asserts Commits c0b334c5bfa9 and ea9b0c8a26a2 introduced new sparse warnings by accessing rcu_node->lock directly and ignoring the __private marker. Introduce a new wrapper and use it. Also fix a similar problem in srcutree.c introduced by a3883df3935e. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
d6331980 |
|
10-Oct-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Prohibit call_srcu() use under raw spinlocks Invoking queue_delayed_work() while holding a raw spinlock is forbidden in -rt kernels, which is exactly what __call_srcu() does, indirectly via srcu_funnel_gp_start(). This commit therefore downgrades Tree SRCU's locking from raw to non-raw spinlocks, which works because call_srcu() is not ever called while holding a raw spinlock. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
e4d0b679 |
|
17-Sep-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add parameters to SRCU docbook comments Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
27fdb35f |
|
19-Oct-2017 |
Paul E. McKenney <paulmck@kernel.org> |
doc: Fix various RCU docbook comment-header problems Because many of RCU's files have not been included into docbook, a number of errors have accumulated. This commit fixes them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
35732cf9 |
|
05-Jul-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Provide ordering for CPU not involved in grace period Tree RCU guarantees that every online CPU has a memory barrier between any given grace period and any of that CPU's RCU read-side sections that must be ordered against that grace period. Since RCU doesn't always know where read-side critical sections are, the actual implementation guarantees order against prior and subsequent non-idle non-offline code, whether in an RCU read-side critical section or not. As a result, there does not need to be a memory barrier at the end of synchronize_rcu() and friends because the ordering internal to the grace period has ordered every CPU's post-grace-period execution against each CPU's pre-grace-period execution, again for all non-idle online CPUs. In contrast, SRCU can have non-idle online CPUs that are completely uninvolved in a given SRCU grace period, for example, a CPU that never runs any SRCU read-side critical sections and took no part in the grace-period processing. It is in theory possible for a given synchronize_srcu()'s wakeup to be delivered to a CPU that was completely uninvolved in the prior SRCU grace period, which could mean that the code following that synchronize_srcu() would end up being unordered with respect to both the grace period and any pre-existing SRCU read-side critical sections. This commit therefore adds an smp_mb() to the end of __synchronize_srcu(), which prevents this scenario from occurring. Reported-by: Lance Roy <ldr709@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Lance Roy <ldr709@gmail.com> Cc: <stable@vger.kernel.org> # 4.12.x
|
#
ac3748c6 |
|
22-May-2017 |
Paul E. McKenney <paulmck@kernel.org> |
rcutorture: Print SRCU lock/unlock totals This commit adds printing of SRCU lock/unlock totals, which are just the sums of the per-CPU counts. Saves a bit of mental arithmetic. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
115a1a52 |
|
22-May-2017 |
Paul E. McKenney <paulmck@kernel.org> |
rcutorture: Move SRCU status printing to SRCU implementations This commit gets rid of some ugly #ifdefs in rcutorture.c by moving the SRCU status printing to the SRCU implementations. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
0d8a1e83 |
|
15-Jun-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make process_srcu() be static The function process_srcu() is not invoked outside of srcutree.c, so this commit makes it static and drops the EXPORT_SYMBOL_GPL(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
a3883df3 |
|
09-May-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Use rnp->lock wrappers to replace explicit memory barriers This commit uses TREE RCU's rnp->lock wrappers to replace a few explicit memory barriers. This change also has the advantage of making SRCU's memory-ordering properties be implemented in roughly the same way as they are in Tree RCU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
5a0465e1 |
|
04-May-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Shrink srcu.h by moving docbook and private function The call_srcu() docbook entry is currently in include/linux/srcu.h, which causes needless processing for each include point. This commit therefore moves this entry to kernel/rcu/srcutree.c, which the compiler reads only once. In addition, the srcu_batches_completed() function is used only within RCU and its torture-test suites. This commit therefore also moves this function's declaration from include/linux/srcutiny.h, include/linux/srcutree.h, and include/linux/srcuclassic.h to kernel/rcu/rcu.h. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
c350c008 |
|
03-May-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Prevent sdp->srcu_gp_seq_needed counter wrap If a given CPU never happens to ever start an SRCU grace period, the grace-period sequence counter might wrap. If this CPU were to decide to finally start a grace period, the state of its sdp->srcu_gp_seq_needed might make it appear that it has already requested this grace period, which would prevent starting the grace period. If no other CPU ever started a grace period again, this would look like a grace-period hang. Even if some other CPU took pity and started the needed grace period, the leaf rcu_node structure's ->srcu_data_have_cbs field won't have record of the fact that this CPU has a callback pending, which would look like a very localized grace-period hang. This might seem very unlikely, but SRCU grace periods can take less than a microsecond on small systems, which means that overflow can happen in much less than an hour on a 32-bit embedded system. And embedded systems are especially likely to have long-term idle CPUs. Therefore, it makes sense to prevent this scenario from happening. This commit therefore scans each srcu_data structure occasionally, with frequency controlled by the srcutree.counter_wrap_check kernel boot parameter. This parameter can be set to something like 255 in order to exercise the counter-wrap-prevention code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
a602538e |
|
28-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Add DEBUG_OBJECTS_RCU_HEAD functionality This commit adds DEBUG_OBJECTS_RCU_HEAD checking to detect call_srcu() counterparts to double-free bugs. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
0c8e0e3c |
|
28-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Print non-default exp_holdoff values at boot time This commit makes srcu_bootup_announce() check for non-default values of the auto-expedite holdoff time exp_holdoff and print a message if so. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
b5815e6c |
|
28-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make exp_holdoff module parameter be static Because exp_holdoff is not used outside of srcutree.c, it can be static. This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
1f4f6da1 |
|
21-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make Classic and Tree SRCU announce themselves at bootup Currently, the only way to tell whether a given kernel is running Classic, Tiny, or Tree SRCU is to look at the .config file, which can easily be lost or associated with the wrong kernel. This commit therefore has Classic and Tree SRCU identify themselves at boot time. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
881ec9d2 |
|
12-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Eliminate possibility of destructive counter overflow Earlier versions of Tree SRCU were subject to a counter overflow bug that could theoretically result in too-short grace periods. This commit eliminates this problem by adding an update-side memory barrier. The short explanation is that if the updater sums the unlock counts too late to see a given __srcu_read_unlock() increment, that CPU's next __srcu_read_lock() must see the new value of ->srcu_idx, thus incrementing the other bank of counters. This eliminates the possibility of destructive counter overflow as long as the srcu_read_lock() nesting level does not exceed floor(ULONG_MAX/NR_CPUS/2), which should be an eminently reasonable nesting limit, especially on 64-bit systems. Reported-by: Lance Roy <ldr709@gmail.com> Suggested-by: Lance Roy <ldr709@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
cdf7abc4 |
|
31-May-2017 |
Paolo Bonzini <pbonzini@redhat.com> |
srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting down a guest running iperf on a VFIO assigned device. This happens because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt context, while a worker thread does the same inside kvm_set_irq(). If the interrupt happens while the worker thread is executing __srcu_read_lock(), updates to the Classic SRCU ->lock_count[] field or the Tree SRCU ->srcu_lock_count[] field can be lost. The docs say you are not supposed to call srcu_read_lock() and srcu_read_unlock() from irq context, but KVM interrupt injection happens from (host) interrupt context and it would be nice if SRCU supported the use case. KVM is using SRCU here not really for the "sleepable" part, but rather due to its IPI-free fast detection of grace periods. It is therefore not desirable to switch back to RCU, which would effectively revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16). However, the docs are overly conservative. You can have an SRCU instance only has users in irq context, and you can mix process and irq context as long as process context users disable interrupts. In addition, __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and Classic SRCU. For those two implementations, only srcu_read_lock() is unsafe. When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(), in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments. Therefore it kept __this_cpu_inc(), with preempt_disable/enable in the caller. Tree SRCU however only does one increment, so on most architectures it is more efficient for __srcu_read_lock() to use this_cpu_inc(), and any performance differences appear to be down in the noise. Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on a single variable. Therefore, as Peter Zijlstra pointed out, Tiny SRCU's implementation already supports mixed-context use of srcu_read_lock() and srcu_read_unlock(), at least as long as uses of srcu_read_lock() and srcu_read_unlock() in each handler are nested and paired properly. In other words, it is still illegal to (say) invoke srcu_read_lock() in an interrupt handler and to invoke the matching srcu_read_unlock() in a softirq handler. Therefore, the only change required for Tiny SRCU is to its comments. Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING") Reported-by: Linu Cherian <linuc.decode@gmail.com> Suggested-by: Linu Cherian <linuc.decode@gmail.com> Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
45753c5f |
|
02-May-2017 |
Ingo Molnar <mingo@kernel.org> |
srcu: Debloat the <linux/rcu_segcblist.h> header Linus noticed that the <linux/rcu_segcblist.h> has huge inline functions which should not be inline at all. As a first step in cleaning this up, move them all to kernel/rcu/ and only keep an absolute minimum of data type defines in the header: before: -rw-r--r-- 1 mingo mingo 22284 May 2 10:25 include/linux/rcu_segcblist.h after: -rw-r--r-- 1 mingo mingo 3180 May 2 10:22 include/linux/rcu_segcblist.h More can be done, such as uninlining the large functions, which inlining is unjustified even if it's an RCU internal matter. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
#
b5fe223a |
|
27-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Adjust default auto-expediting holdoff The default value for the kernel boot parameter srcutree.exp_holdoff is 50 microseconds, which is too long for good Tree SRCU performance (compared to Classic SRCU) on the workloads tested by Mike Galbraith. This commit therefore sets the default value to 25 microseconds, which shows excellent results in Mike's testing. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <efault@gmx.de>
|
#
22607d66 |
|
25-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Specify auto-expedite holdoff time On small systems, in the absence of readers, expedited SRCU grace periods can complete in less than a microsecond. This means that an eight-CPU system can have all CPUs doing synchronize_srcu() in a tight loop and almost always expedite. This might actually be desirable in some situations, but in general it is a good way to needlessly burn CPU cycles. And in those situations where it is desirable, your friend is the function synchronize_srcu_expedited(). For other situations, this commit adds a kernel parameter that specifies a holdoff between completing the last SRCU grace period and auto-expediting the next. If the next grace period starts before the holdoff expires, auto-expediting is disabled. The holdoff is 50 microseconds by default, and can be tuned to the desired number of nanoseconds. A value of zero disables auto-expediting. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <efault@gmx.de>
|
#
2da4b2a7 |
|
25-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Expedite first synchronize_srcu() when idle Classic SRCU in effect expedites the first synchronize_srcu() when SRCU is idle, and Mike Galbraith demonstrated that some use cases do in fact rely on this behavior. In particular, Mike showed that Steven Rostedt's hotplug stress script takes 55 seconds with Classic SRCU and more than 16 -minutes- when running Tree SRCU. Assuming that each Tree SRCU's call to synchronize_srcu() takes four milliseconds, this implies that Steven's test invokes synchronize_srcu() in isolation, but more than once per 200 microseconds. Mike used ftrace to demonstrate that the time between successive calls to synchronize_srcu() ranged from 118 to 342 microseconds, with one outlier at 80 milliseconds. This data clearly indicates that Tree SRCU needs to expedite the first invocation of synchronize_srcu() during an SRCU idle period. This commit therefor introduces a srcu_might_be_idle() function that probabilistically checks whether or not SRCU is idle. This function is used by synchronize_rcu() as an additional criterion in deciding whether or not to expedite. (Hat trick to Peter Zijlstra for his earlier suggestion that this might in fact be a problem. Which for all I know might have motivated Mike to look into it.) Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <efault@gmx.de>
|
#
1e9a038b |
|
24-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Expedited grace periods with reduced memory contention Commit f60d231a87c5 ("srcu: Crude control of expedited grace periods") introduced a per-srcu_struct atomic counter to track outstanding requests for grace periods. This works, but represents a memory-contention bottleneck. This commit therefore uses the srcu_node combining tree to remove this bottleneck. This commit adds new ->srcu_gp_seq_needed_exp fields to the srcu_data, srcu_node, and srcu_struct structures, which track the farthest-in-the-future grace period that must be expedited, which in turn requires that all nearer-term grace periods also be expedited. Requests for expediting start with the srcu_data structure, run up through the srcu_node tree, and end at the srcu_struct structure. Note that it may be necessary to expedite a grace period that just now started, and this is handled by a new srcu_funnel_exp_start() function, which is invoked when the grace period itself is already in its way, but when that grace period was not marked as expedited. A new srcu_get_delay() function returns zero if there is at least one expedited SRCU grace period in flight, or SRCU_INTERVAL otherwise. This function is used to calculate delays: Normal grace periods are allowed to extend in order to cover more requests with a given grace-period computation, which decreases per-request overhead. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <efault@gmx.de>
|
#
7f6733c3 |
|
18-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Make rcutorture writer stalls print SRCU GP state In the past, SRCU was simple enough that there was little point in making the rcutorture writer stall messages print the SRCU grace-period number state. With the advent of Tree SRCU, this has changed. This commit therefore makes Classic, Tiny, and Tree SRCU report this state to rcutorture as needed. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <efault@gmx.de>
|
#
c7e88067 |
|
18-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Exact tracking of srcu_data structures containing callbacks The current Tree SRCU implementation schedules a workqueue for every srcu_data covered by a given leaf srcu_node structure having callbacks, even if only one of those srcu_data structures actually contains callbacks. This is clearly inefficient for workloads that don't feature callbacks everywhere all the time. This commit therefore adds an array of masks that are used by the leaf srcu_node structures to track exactly which srcu_data structures contain callbacks. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <efault@gmx.de>
|
#
0497b489 |
|
18-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Expedite srcu_schedule_cbs_snp() callback invocation Although Tree SRCU does reduce delays when there is at least one synchronize_srcu_expedited() invocation pending, srcu_schedule_cbs_snp() still waits for SRCU_INTERVAL before invoking callbacks. Since synchronize_srcu_expedited() now posts a callback and waits for that callback to do a wakeup, this destroys the expedited nature of synchronize_srcu_expedited(). This destruction became apparent to Marc Zyngier in the guise of a guest-OS bootup slowdown from five seconds to no fewer than forty seconds. This commit therefore invokes callbacks immediately at the end of the grace period when there is at least one synchronize_srcu_expedited() invocation pending. This brought Marc's guest-OS bootup times back into the realm of reason. Reported-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Marc Zyngier <marc.zyngier@arm.com>
|
#
da915ad5 |
|
05-Apr-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Parallelize callback handling Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2], however, there are workloads that could result in a high volume of concurrent invocations of call_srcu(), which with current SRCU would result in excessive lock contention on the srcu_struct structure's ->queue_lock, which protects SRCU's callback lists. This commit therefore moves SRCU to per-CPU callback lists, thus greatly reducing contention. Because a given SRCU instance no longer has a single centralized callback list, starting grace periods and invoking callbacks are both more complex than in the single-list Classic SRCU implementation. Starting grace periods and handling callbacks are now handled using an srcu_node tree that is in some ways similar to the rcu_node trees used by RCU-bh, RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is controlled by exactly the same Kconfig options and boot parameters that control the shape of the rcu_node tree). In addition, the old per-CPU srcu_array structure is now named srcu_data and contains an rcu_segcblist structure named ->srcu_cblist for its callbacks (and a spinlock to protect this). The srcu_struct gets an srcu_gp_seq that is used to associate callback segments with the corresponding completion-time grace-period number. These completion-time grace-period numbers are propagated up the srcu_node tree so that the grace-period workqueue handler can determine whether additional grace periods are needed on the one hand and where to look for callbacks that are ready to be invoked. The srcu_barrier() function must now wait on all instances of the per-CPU ->srcu_cblist. Because each ->srcu_cblist is protected by ->lock, srcu_barrier() can remotely add the needed callbacks. In theory, it could also remotely start grace periods, but in practice doing so is complex and racy. And interestingly enough, it is never necessary for srcu_barrier() to start a grace period because srcu_barrier() only enqueues a callback when a callback is already present--and it turns out that a grace period has to have already been started for this pre-existing callback. Furthermore, it is only the callback that srcu_barrier() needs to wait on, not any particular grace period. Therefore, a new rcu_segcblist_entrain() function enqueues the srcu_barrier() function's callback into the same segment occupied by the last pre-existing callback in the list. The special case where all the pre-existing callbacks are on a different list (because they are in the process of being invoked) is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL segment, relying on the done-callbacks check that takes place after all callbacks are inovked. Note that the readers use the same algorithm as before. Note that there is a separate srcu_idx that tells the readers what counter to increment. This unfortunately cannot be combined with srcu_gp_seq because they need to be incremented at different times. This commit introduces some ugly #ifdefs in rcutorture. These will go away when I feel good enough about Tree SRCU to ditch Classic SRCU. Some crude performance comparisons, courtesy of a quickly hacked rcuperf asynchronous-grace-period capability: Callback Queuing Overhead ------------------------- # CPUS Classic SRCU Tree SRCU ------ ------------ --------- 2 0.349 us 0.342 us 16 31.66 us 0.4 us 41 --------- 0.417 us The times are the 90th percentiles, a statistic that was chosen to reject the overheads of the occasional srcu_barrier() call needed to avoid OOMing the test machine. The rcuperf test hangs when running Classic SRCU at 41 CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code and that statistics, this is a convincing demonstration of Tree SRCU's performance and scalability advantages. [1] https://lwn.net/Articles/309030/ [2] https://patchwork.kernel.org/patch/5108281/ Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
|
#
dad81a20 |
|
25-Mar-2017 |
Paul E. McKenney <paulmck@kernel.org> |
srcu: Introduce CLASSIC_SRCU Kconfig option The TREE_SRCU rewrite is large and a bit on the non-simple side, so this commit helps reduce risk by allowing the old v4.11 SRCU algorithm to be selected using a new CLASSIC_SRCU Kconfig option that depends on RCU_EXPERT. The default is to use the new TREE_SRCU and TINY_SRCU algorithms, in order to help get these the testing that they need. However, if your users do not require the update-side scalability that is to be provided by TREE_SRCU, select RCU_EXPERT and then CLASSIC_SRCU to revert back to the old classic SRCU algorithm. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|