History log of /linux-master/kernel/rcu/tree.c
Revision Date Author Comments
# 30ef0963 22-Feb-2024 Paul E. McKenney <paulmck@kernel.org>

rcu-tasks: Initialize callback lists at rcu_init() time

In order for RCU Tasks to reliably maintain per-CPU lists of exiting
tasks, those lists must be initialized before it is possible for tasks
to exit, especially given that the boot CPU is not necessarily CPU 0
(an example being, powerpc kexec() kernels). And at the time that
rcu_init_tasks_generic() is called, a task could potentially exit,
unconventional though that sort of thing might be.

This commit therefore moves the calls to cblist_init_generic() from
functions called from rcu_init_tasks_generic() to a new function named
tasks_cblist_init_generic() that is invoked from rcu_init().

This constituted a bug in a commit that never went to mainline, so
there is no need for any backporting to -stable.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# 7f66f099 02-Dec-2023 Qais Yousef <qyousef@layalina.io>

rcu: Provide a boot time parameter to control lazy RCU

To allow more flexible arrangements while still provide a single kernel
for distros, provide a boot time parameter to enable/disable lazy RCU.

Specify:

rcutree.enable_rcu_lazy=[y|1|n|0]

Which also requires

rcu_nocbs=all

at boot time to enable/disable lazy RCU.

To disable it by default at build time when CONFIG_RCU_LAZY=y, the new
CONFIG_RCU_LAZY_DEFAULT_OFF can be used.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# 23da2ad6 12-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu/exp: Remove rcu_par_gp_wq

TREE04 running on short iterations can produce writer stalls of the
following kind:

??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0
task:rcu_torture_wri state:D stack:14568 pid:83 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2de/0x850
? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0
schedule+0x4f/0x90
synchronize_rcu_expedited+0x430/0x670
? __pfx_autoremove_wake_function+0x10/0x10
? __pfx_synchronize_rcu_expedited+0x10/0x10
do_rtws_sync.constprop.0+0xde/0x230
rcu_torture_writer+0x4b4/0xcd0
? __pfx_rcu_torture_writer+0x10/0x10
kthread+0xc7/0xf0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>

Waiting for an expedited grace period and polling for an expedited
grace period both are operations that internally rely on the same
workqueue performing necessary asynchronous work.

However, a dependency chain is involved between those two operations,
as depicted below:

====== CPU 0 ======= ====== CPU 1 =======

synchronize_rcu_expedited()
exp_funnel_lock()
mutex_lock(&rcu_state.exp_mutex);
start_poll_synchronize_rcu_expedited
queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
synchronize_rcu_expedited_queue_work()
queue_work(rcu_gp_wq, &rew->rew_work);
wait_event() // A, wait for &rew->rew_work completion
mutex_unlock() // B
//======> switch to kworker

sync_rcu_do_polled_gp() {
synchronize_rcu_expedited()
exp_funnel_lock()
mutex_lock(&rcu_state.exp_mutex); // C, wait B
....
} // D

Since workqueues are usually implemented on top of several kworkers
handling the queue concurrently, the above situation wouldn't deadlock
most of the time because A then doesn't depend on D. But in case of
memory stress, a single kworker may end up handling alone all the works
in a serialized way. In that case the above layout becomes a problem
because A then waits for D, closing a circular dependency:

A -> D -> C -> B -> A

This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed
synchronize_rcu_expedited() is otherwise implemented on top of a kthread
worker while polling still relies on rcu_gp_wq workqueue, breaking the
above circular dependency chain.

Fix this with making expedited grace period to always rely on kthread
worker. The workqueue based implementation is essentially a duplicate
anyway now that the per-node initialization is performed by per-node
kthread workers.

Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to
manage the scheduler policy of these kthread workers.

Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# b67cffcb 12-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu/exp: Handle parallel exp gp kworkers affinity

Affine the parallel expedited gp kworkers to their respective RCU node
in order to make them close to the cache their are playing with.

This reuses the boost kthreads machinery that probe into CPU hotplug
operations such that the kthreads become/stay affine to their respective
node as soon/long as they contain online CPUs. Otherwise and if the
current CPU going down was the last online on the leaf node, the related
kthread is affine to the housekeeping CPUs.

In the long run, this affinity VS CPU hotplug operation game should
probably be implemented at the generic kthread level.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
[boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# 8e5e6215 12-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu/exp: Make parallel exp gp kworker per rcu node

When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node
initialization is performed in parallel via workqueues (one work per
node).

However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is
performed by a single kworker serializing each node initialization (one
work for all nodes).

The second part is certainly less scalable and efficient beyond a single
leaf node.

To improve this, expand this single kworker into per-node kworkers. This
new layout is eventually intended to remove the workqueues based
implementation since it will essentially now become duplicate code.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# c19e5d3b 12-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu/exp: Move expedited kthread worker creation functions above rcutree_prepare_cpu()

The expedited kthread worker performing the per node initialization is
going to be split into per node kthreads. As such, the future per node
kthread creation will need to be called from CPU hotplug callbacks
instead of an initcall, right beside the per node boost kthread
creation.

To prepare for that, move the kthread worker creation above
rcutree_prepare_cpu() as a first step to make the review smoother for
the upcoming modifications.

No intended functional change.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# 7836b270 12-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu: s/boost_kthread_mutex/kthread_mutex

This mutex is currently protecting per node boost kthreads creation and
affinity setting across CPU hotplug operations.

Since the expedited kworkers will soon be split per node as well, they
will be subject to the same concurrency constraints against hotplug.

Therefore their creation and affinity tuning operations will be grouped
with those of boost kthreads and then rely on the same mutex.

To prepare for that, generalize its name.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# e7539ffc 12-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu/exp: Handle RCU expedited grace period kworker allocation failure

Just like is done for the kworker performing nodes initialization,
gracefully handle the possible allocation failure of the RCU expedited
grace period main kworker.

While at it perform a rename of the related checking functions to better
reflect the expedited specifics.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# a636c5e6 12-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure recovery

Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited
grace periods is queued to a kworker. However if the allocation of that
kworker failed, the nodes initialization is performed synchronously by
the caller instead.

Now the check for kworker initialization failure relies on the kworker
pointer to be NULL while its value might actually encapsulate an
allocation failure error.

Make sure to handle this case.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# afd4e696 09-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Re-arrange call_rcu() NOCB specific code

Currently the call_rcu() function interleaves NOCB and !NOCB enqueue
code in a complicated way such that:

* The bypass enqueue code may or may not have enqueued and may or may
not have locked the ->nocb_lock. Everything that follows is in a
Schrödinger locking state for the unwary reviewer's eyes.

* The was_alldone is always set but only used in NOCB related code.

* The NOCB wake up is distantly related to the locking hopefully
performed by the bypass enqueue code that did not enqueue on the
bypass list.

Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to
their own functions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# b913c3fe 09-Jan-2024 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Make IRQs disablement symmetric

Currently IRQs are disabled on call_rcu() and then depending on the
context:

* If the CPU is in nocb mode:

- If the callback is enqueued in the bypass list, IRQs are re-enabled
implictly by rcu_nocb_try_bypass()

- If the callback is enqueued in the normal list, IRQs are re-enabled
implicitly by __call_rcu_nocb_wake()

* If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu()

This makes the code a bit hard to follow, especially as it interleaves
with nocb locking.

To make the IRQ flags coverage clearer and also in order to prepare for
moving all the nocb enqueue code to its own function, always re-enable
the IRQ flags explicitly from call_rcu().

Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# 1e8e6951 15-Nov-2023 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Remove needless full barrier after callback advancing

A full barrier is issued from nocb_gp_wait() upon callbacks advancing
to order grace period completion with callbacks execution.

However these two events are already ordered by the
smp_mb__after_unlock_lock() barrier within the call to
raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
happen.

The following litmus test shows the kind of guarantee that this barrier
provides:

C smp_mb__after_unlock_lock

{}

// rcu_gp_cleanup()
P0(spinlock_t *rnp_lock, int *gpnum)
{
// Grace period cleanup increase gp sequence number
spin_lock(rnp_lock);
WRITE_ONCE(*gpnum, 1);
spin_unlock(rnp_lock);
}

// nocb_gp_wait()
P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
{
int r1;

// Call rcu_advance_cbs() from nocb_gp_wait()
spin_lock(nocb_lock);
spin_lock(rnp_lock);
smp_mb__after_unlock_lock();
r1 = READ_ONCE(*gpnum);
WRITE_ONCE(*cb_ready, 1);
spin_unlock(rnp_lock);
spin_unlock(nocb_lock);
}

// nocb_cb_wait()
P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
{
int r2;

// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
spin_lock(nocb_lock);
r2 = READ_ONCE(*cb_ready);
spin_unlock(nocb_lock);

// Actual callback execution
WRITE_ONCE(*cb_executed, 1);
}

P3(int *cb_executed, int *gpnum)
{
int r3;

WRITE_ONCE(*cb_executed, 2);
smp_mb();
r3 = READ_ONCE(*gpnum);
}

exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *)

Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is
removed. This barrier orders the grace period completion against
callbacks advancing and even later callbacks invocation, thanks to the
opportunistic propagation via the ->nocb_lock to nocb_cb_wait().

Therefore the smp_mb() placed after callbacks advancing can be safely
removed.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>


# e787644c 18-Dec-2023 Frederic Weisbecker <frederic@kernel.org>

rcu: Defer RCU kthreads wakeup when CPU is dying

When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).

If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.

This triggers TREE03 rcutorture hangs:

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2eb/0xa80
schedule+0x1f/0x90
schedule_timeout+0x163/0x270
? __pfx_process_timeout+0x10/0x10
rcu_gp_fqs_loop+0x37c/0x5b0
? __pfx_rcu_gp_kthread+0x10/0x10
rcu_gp_kthread+0x17c/0x200
kthread+0xde/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2b/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>

The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.

So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>


# dee39c0c 31-Oct-2023 Zqiang <qiang.zhang1211@gmail.com>

rcu: Force quiescent states only for ongoing grace period

If an rcutorture test scenario creates an fqs_task kthread, it will
periodically invoke rcu_force_quiescent_state() in order to start
force-quiescent-state (FQS) operations. However, an FQS operation
will be started even if there is no RCU grace period in progress.
Although testing FQS operations startup when there is no grace period in
progress is necessary, it need not happen all that often. This commit
therefore causes rcu_force_quiescent_state() to take an early exit
if there is no grace period in progress.

Note that there will still be attempts to start an FQS scan in the
absence of a grace period because the grace period might end right
after the rcu_force_quiescent_state() function's check. In actual
testing, this happens about once every ten minutes, which should
provide adequate testing.

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>


# 2be4686d 27-Oct-2023 Frederic Weisbecker <frederic@kernel.org>

rcu: Introduce rcu_cpu_online()

Export the RCU point of view as to when a CPU is considered offline
(ie: when does RCU consider that a CPU is sufficiently down in the
hotplug process to not feature any possible read side).

This will be used by RCU-tasks whose vision of an offline CPU should
reasonably match the one of RCU core.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 85d68222 31-Oct-2023 Peter Zijlstra <peterz@infradead.org>

rcu: Break rcu_node_0 --> &rq->__lock order

Commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") added a kfree() call to free any user
provided affinity mask, if present. It was changed later to use
kfree_rcu() in commit 9a5418bc48ba ("sched/core: Use kfree_rcu()
in do_set_cpus_allowed()") to avoid a circular locking dependency
problem.

It turns out that even kfree_rcu() isn't safe for avoiding
circular locking problem. As reported by kernel test robot,
the following circular locking dependency now exists:

&rdp->nocb_lock --> rcu_node_0 --> &rq->__lock

Solve this by breaking the rcu_node_0 --> &rq->__lock chain by moving
the resched_cpu() out from under rcu_node lock.

[peterz: heavily borrowed from Waiman's Changelog]
[paulmck: applied Z qiang feedback]

Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/oe-lkp/202310302207.a25f1a30-oliver.sang@intel.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 21e0b932 11-Sep-2023 Qi Zheng <zhengqi.arch@bytedance.com>

rcu: dynamically allocate the rcu-kfree shrinker

Use new APIs to dynamically allocate the rcu-kfree shrinker.

Link: https://lkml.kernel.org/r/20230911094444.68966-17-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 448e9f34 08-Sep-2023 Frederic Weisbecker <frederic@kernel.org>

rcu: Standardize explicit CPU-hotplug calls

rcu_report_dead() and rcutree_migrate_callbacks() have their headers in
rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug
functions.

Also rcu_cpu_starting() and rcu_report_dead() have different naming
conventions while they mirror each other's effects.

Fix the headers and propose a naming that relates both functions and
aligns with the prefix of other rcutree CPU-hotplug functions.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 2cb1f6e9 08-Sep-2023 Frederic Weisbecker <frederic@kernel.org>

rcu: Conditionally build CPU-hotplug teardown callbacks

Among the three CPU-hotplug teardown RCU callbacks, two of them early
exit if CONFIG_HOTPLUG_CPU=n, and one is left unchanged. In any case
all of them have an implementation when CONFIG_HOTPLUG_CPU=n.

Align instead with the common way to deal with CPU-hotplug teardown
callbacks and provide a proper stub when they are not supported.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# c964c1f5 08-Sep-2023 Frederic Weisbecker <frederic@kernel.org>

rcu: Assume rcu_report_dead() is always called locally

rcu_report_dead() has to be called locally by the CPU that is going to
exit the RCU state machine. Passing a cpu argument here is error-prone
and leaves the possibility for a racy remote call.

Use local access instead.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 358662a9 08-Sep-2023 Frederic Weisbecker <frederic@kernel.org>

rcu: Assume IRQS disabled from rcu_report_dead()

rcu_report_dead() is the last RCU word from the CPU down through the
hotplug path. It is called in the idle loop right before the CPU shuts
down for good. Because it removes the CPU from the grace period state
machine and reports an ultimate quiescent state if necessary, no further
use of RCU is allowed. Therefore it is expected that IRQs are disabled
upon calling this function and are not to be re-enabled again until the
CPU shuts down.

Remove the IRQs disablement from that function and verify instead that
it is actually called with IRQs disabled as it is expected at that
special point in the idle path.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 5f98fd03 30-Sep-2023 Catalin Marinas <catalin.marinas@arm.com>

rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects

Since the actual slab freeing is deferred when calling kvfree_rcu(), so
is the kmemleak_free() callback informing kmemleak of the object
deletion. From the perspective of the kvfree_rcu() caller, the object is
freed and it may remove any references to it. Since kmemleak does not
scan RCU internal data storing the pointer, it will report such objects
as leaks during the grace period.

Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note
that the tiny RCU implementation does not have such issue since the
objects can be tracked from the rcu_ctrlblk structure.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/
Cc: <stable@vger.kernel.org>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 0ae9942f 18-Aug-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate rcu_gp_slow_unregister() false positive

When using rcutorture as a module, there are a number of conditions that
can abort the modprobe operation, for example, when attempting to run
both RCU CPU stall warning tests and forward-progress tests. This can
cause rcu_torture_cleanup() to be invoked on the unwind path out of
rcu_rcu_torture_init(), which will mean that rcu_gp_slow_unregister()
is invoked without a matching rcu_gp_slow_register(). This will cause
a splat because rcu_gp_slow_unregister() is passed rcu_fwd_cb_nodelay,
which does not match a NULL pointer.

This commit therefore forgives a mismatch involving a NULL pointer, thus
avoiding this false-positive splat.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 2cbc482d 04-Aug-2023 Zhen Lei <thunder.leizhen@huawei.com>

rcu: Dump memory object info if callback function is invalid

When a structure containing an RCU callback rhp is (incorrectly) freed
and reallocated after rhp is passed to call_rcu(), it is not unusual for
rhp->func to be set to NULL. This defeats the debugging prints used by
__call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y,
which expect to identify the offending code using the identity of this
function.

And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things
are even worse, as can be seen from this splat:

Unable to handle kernel NULL pointer dereference at virtual address 0
... ...
PC is at 0x0
LR is at rcu_do_batch+0x1c0/0x3b8
... ...
(rcu_do_batch) from (rcu_core+0x1d4/0x284)
(rcu_core) from (__do_softirq+0x24c/0x344)
(__do_softirq) from (__irq_exit_rcu+0x64/0x108)
(__irq_exit_rcu) from (irq_exit+0x8/0x10)
(irq_exit) from (__handle_domain_irq+0x74/0x9c)
(__handle_domain_irq) from (gic_handle_irq+0x8c/0x98)
(gic_handle_irq) from (__irq_svc+0x5c/0x94)
(__irq_svc) from (arch_cpu_idle+0x20/0x3c)
(arch_cpu_idle) from (default_idle_call+0x4c/0x78)
(default_idle_call) from (do_idle+0xf8/0x150)
(do_idle) from (cpu_startup_entry+0x18/0x20)
(cpu_startup_entry) from (0xc01530)

This commit therefore adds calls to mem_dump_obj(rhp) to output some
information, for example:

slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256

This provides the rough size of the memory block and the offset of the
rcu_head structure, which as least provides at least a few clues to help
locate the problem. If the problem is reproducible, additional slab
debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can
provide significantly more information.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 16128b1f 01-Aug-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Add sysfs to provide throttled access to rcu_barrier()

When running a series of stress tests all making heavy use of RCU,
it is all too possible to OOM the system when the prior test's RCU
callbacks don't get invoked until after the subsequent test starts.
One way of handling this is just a timed wait, but this fails when a
given CPU has so many callbacks queued that they take longer to invoke
than allowed for by that timed wait.

This commit therefore adds an rcutree.do_rcu_barrier module parameter that
is accessible from sysfs. Writing one of the many synonyms for boolean
"true" will cause an rcu_barrier() to be invoked, but will guarantee that
no more than one rcu_barrier() will be invoked per sixteenth of a second
via this mechanism. The flip side is that a given request might wait a
second or three longer than absolutely necessary, but only when there are
multiple uses of rcutree.do_rcu_barrier within a one-second time interval.

This commit unnecessarily serializes the rcu_barrier() machinery, given
that serialization is already provided by procfs. This has the advantage
of allowing throttled rcu_barrier() from other sources within the kernel.

Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 4502138a 29-Jul-2023 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Remove superfluous return from void call_rcu* functions

The return keyword is not needed here.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# b96e7a5f 04-Sep-2023 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Defer setting of jiffies during stall reset

There are instances where rcu_cpu_stall_reset() is called when jiffies
did not get a chance to update for a long time. Before jiffies is
updated, the CPU stall detector can go off triggering false-positives
where a just-started grace period appears to be ages old. In the past,
we disabled stall detection in rcu_cpu_stall_reset() however this got
changed [1]. This is resulting in false-positives in KGDB usecase [2].

Fix this by deferring the update of jiffies to the third run of the FQS
loop. This is more robust, as, even if rcu_cpu_stall_reset() is called
just before jiffies is read, we would end up pushing out the jiffies
read by 3 more FQS loops. Meanwhile the CPU stall detection will be
delayed and we will not get any false positives.

[1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.org/
[2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/

Tested with rcutorture.cpu_stall option as well to verify stall behavior
with/without patch.

Tested-by: Huacai Chen <chenhuacai@loongson.cn>
Reported-by: Binbin Zhou <zhoubinbin@loongson.cn>
Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/
Suggested-by: Paul McKenney <paulmck@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: a80be428fbc1 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 343640cb 26-Jun-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs load

The rcu_request_urgent_qs_task() function does a cross-CPU store
to ->rcu_urgent_qs, so this commit therefore marks the load in
__rcu_irq_enter_check_tick() with READ_ONCE().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# c924bf5a 23-May-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Clarify rcu_is_watching() kernel-doc comment

Make it clear that this function always returns either true or false
without other planned failure modes.

Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# 401b0de3 26-Apr-2023 Paul E. McKenney <paulmck@kernel.org>

rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs

The rcu_tasks_invoke_cbs() function relies on queue_work_on() to silently
fall back to WORK_CPU_UNBOUND when the specified CPU is offline. However,
the queue_work_on() function's silent fallback mechanism relies on that
CPU having been online at some time in the past. When queue_work_on()
is passed a CPU that has never been online, workqueue lockups ensue,
which can be bad for your kernel's general health and well-being.

This commit therefore checks whether a given CPU has ever been online,
and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
queue_work_on(). Why not simply omit the queue_work_on() call entirely?
Because this function is flooding callback-invocation notifications
to all CPUs, and must deal with possibilities that include a sparse
cpu_possible_mask.

This commit also moves the setting of the rcu_data structure's
->beenonline field to rcu_cpu_starting(), which executes on the
incoming CPU before that CPU has ever enabled interrupts. This ensures
that the required workqueues are present. In addition, because the
incoming CPU has not yet enabled its interrupts, there cannot yet have
been any softirq handlers running on this CPU, which means that the
WARN_ON_ONCE(!rdp->beenonline) within the RCU_SOFTIRQ handler cannot
have triggered yet.

Fixes: d363f833c6d88 ("rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 15d44dfa 27-Apr-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_cpu_starting() rely on interrupts being disabled

Currently, rcu_cpu_starting() is written so that it might be invoked
with interrupts enabled. However, it is always called when interrupts
are disabled, either by rcu_init(), notify_cpu_starting(), or from a
call point prior to the call to notify_cpu_starting().

But why bother requiring that interrupts be disabled? The purpose is
to allow the rcu_data structure's ->beenonline flag to be set after all
early processing has completed for the incoming CPU, thus allowing this
flag to be used to determine when workqueues have been set up for the
incoming CPU, while still allowing this flag to be used as a diagnostic
within rcu_core().

This commit therefore makes rcu_cpu_starting() rely on interrupts being
disabled.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a24c1aab 07-Apr-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work

The rcu_data structure's ->rcu_cpu_has_work field can be modified by
any CPU attempting to wake up the rcuc kthread. Therefore, this commit
marks accesses to this field from the rcu_cpu_kthread() function.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f51164a8 31-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Employ jiffies-based backstop to callback time limit

Currently, if there are more than 100 ready-to-invoke RCU callbacks queued
on a given CPU, the rcu_do_batch() function sets a timeout for invocation
of the series. This timeout defaulting to three milliseconds, and may
be adjusted using the rcutree.rcu_resched_ns kernel boot parameter.
This timeout is checked using local_clock(), but the overhead of this
function combined with the common-case very small callback-invocation
overhead means that local_clock() is checked every 32nd invocation.

This works well except for longer-than average callbacks. For example,
a series of 500-microsecond-duration callbacks means that local_clock()
is checked only once every 16 milliseconds, which makes it difficult to
enforce a three-millisecond timeout.

This commit therefore adds a Kconfig option RCU_DOUBLE_CHECK_CB_TIME
that enables backup timeout checking using the coarser grained but
lighter weight jiffies. If the jiffies counter detects a timeout,
then local_clock() is consulted even if this is not the 32nd callback.
This prevents the aforementioned 16-millisecond latency blow.

Reported-by: Domas Mituzas <dmituzas@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# fea1c1f0 21-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Check callback-invocation time limit for rcuc kthreads

Currently, a callback-invocation time limit is enforced only for
callbacks invoked from the softirq environment, the rationale being
that when callbacks are instead invoked from rcuc and rcuoc kthreads,
these callbacks cannot be holding up other softirq vectors.

Which is in fact true. However, if an rcuc kthread spends too much time
invoking callbacks, it can delay quiescent-state reports from its CPU,
which can also be a problem.

This commit therefore applies the callback-invocation time limit to
callback invocation from the rcuc kthreads as well as from softirq.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6b706e56 18-Apr-2023 Zqiang <qiang1.zhang@intel.com>

rcu/kvfree: Make drain_page_cache() take early return if cache is disabled

If the rcutree.rcu_min_cached_objs kernel boot parameter is set to zero,
then krcp->page_cache_work will never be triggered to fill page cache.
In addition, the put_cached_bnode() will not fill page cache. As a
result krcp->bkvcache will always be empty, so there is no need to acquire
krcp->lock to get page from krcp->bkvcache. This commit therefore makes
drain_page_cache() return immediately if the rcu_min_cached_objs is zero.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 60888b77 12-Apr-2023 Zqiang <qiang1.zhang@intel.com>

rcu/kvfree: Make fill page cache start from krcp->nr_bkv_objs

When the fill_page_cache_func() function is invoked, it assumes that
the cache of pages is completely empty. However, there can be some time
between triggering execution of this function and its actual invocation.
During this time, kfree_rcu_work() might run, and might fill in part or
all of this cache of pages, thus invalidating the fill_page_cache_func()
function's assumption.

This will not overfill the cache because put_cached_bnode() will reject
the extra page. However, it will result in a needless allocation and
freeing of one extra page, which might not be helpful under lowish-memory
conditions.

This commit therefore causes the fill_page_cache_func() to explicitly
account for pages that have been placed into the cache shortly before
it starts running.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 021a5ff8 11-Apr-2023 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Do not run a page work if a cache is disabled

By default the cache size is 5 pages per CPU, but it can be disabled at
boot time by setting the rcu_min_cached_objs to zero. When that happens,
the current code will uselessly set an hrtimer to schedule refilling this
cache with zero pages. This commit therefore streamlines this process
by simply refusing the set the hrtimer when rcu_min_cached_objs is zero.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 309a4316 08-Apr-2023 Zqiang <qiang1.zhang@intel.com>

rcu/kvfree: Use consistent krcp when growing kfree_rcu() page cache

The add_ptr_to_bulk_krc_lock() function is invoked to allocate a new
kfree_rcu() page, also known as a kvfree_rcu_bulk_data structure.
The kfree_rcu_cpu structure's lock is used to protect this operation,
except that this lock must be momentarily dropped when allocating memory.
It is clearly important that the lock that is reacquired be the same
lock that was acquired initially via krc_this_cpu_lock().

Unfortunately, this same krc_this_cpu_lock() function is used to
re-acquire this lock, and if the task migrated to some other CPU during
the memory allocation, this will result in the kvfree_rcu_bulk_data
structure being added to the wrong CPU's kfree_rcu_cpu structure.

This commit therefore replaces that second call to krc_this_cpu_lock()
with raw_spin_lock_irqsave() in order to explicitly acquire the lock on
the correct kfree_rcu_cpu structure, thus keeping things straight even
when the task migrates.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 1e237994 04-Apr-2023 Zqiang <qiang1.zhang@intel.com>

rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking bnode->gp_snap

If kvfree_rcu_bulk() sees that the required grace period has failed to
elapse, it leaks the memory because readers might still be using it.
But in that case, the debug-objects subsystem still marks the relevant
structures as having been freed, even though they are instead being
leaked.

This commit fixes this mismatch by invoking debug_rcu_bhead_unqueue()
only when we are actually going to free the objects.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f32276a3 04-Apr-2023 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Add debug check for GP complete for kfree_rcu_cpu list

Under low-memory conditions, kvfree_rcu() will use each object's
rcu_head structure to queue objects in a singly linked list headed by
the kfree_rcu_cpu structure's ->head field. This list is passed to
call_rcu() as a unit, but there is no indication of which grace period
this list needs to wait for. This in turn prevents adding debug checks
in the kfree_rcu_work() as was done for the two page-of-pointers channels
in the kfree_rcu_cpu structure.

This commit therefore adds a ->head_free_gp_snap field to the
kfree_rcu_cpu_work structure to record this grace-period number. It also
adds a WARN_ON_ONCE() to kfree_rcu_monitor() that checks to make sure
that the required grace period has in fact elapsed.

[ paulmck: Fix kerneldoc issue raised by Stephen Rothwell. ]

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# cdfa0f6f 03-Apr-2023 Paul E. McKenney <paulmck@kernel.org>

rcu/kvfree: Add debug to check grace periods

This commit adds debugging checks to verify that the required RCU
grace period has elapsed for each kvfree_rcu_bulk_data structure that
arrives at the kvfree_rcu_bulk() function. These checks make use
of that structure's ->gp_snap field, which has been upgraded from an
unsigned long to an rcu_gp_oldstate structure. This upgrade reduces
the chances of false positives to nearly zero, even on 32-bit systems,
for which this structure carries 64 bits of state.

Cc: Ziwei Dai <ziwei.dai@unisoc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7a29fb4a4 06-Jan-2023 Zheng Yejian <zhengyejian1@huawei.com>

rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed

Registering a kprobe on __rcu_irq_enter_check_tick() can cause kernel
stack overflow as shown below. This issue can be reproduced by enabling
CONFIG_NO_HZ_FULL and booting the kernel with argument "nohz_full=",
and then giving the following commands at the shell prompt:

# cd /sys/kernel/tracing/
# echo 'p:mp1 __rcu_irq_enter_check_tick' >> kprobe_events
# echo 1 > events/kprobes/enable

This commit therefore adds __rcu_irq_enter_check_tick() to the kprobes
blacklist using NOKPROBE_SYMBOL().

Insufficient stack space to handle exception!
ESR: 0x00000000f2000004 -- BRK (AArch64)
FAR: 0x0000ffffccf3e510
Task stack: [0xffff80000ad30000..0xffff80000ad38000]
IRQ stack: [0xffff800008050000..0xffff800008058000]
Overflow stack: [0xffff089c36f9f310..0xffff089c36fa0310]
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
pstate: 400003c5 (nZcv DAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __rcu_irq_enter_check_tick+0x0/0x1b8
lr : ct_nmi_enter+0x11c/0x138
sp : ffff80000ad30080
x29: ffff80000ad30080 x28: ffff089c82e20000 x27: 0000000000000000
x26: 0000000000000000 x25: ffff089c02a8d100 x24: 0000000000000000
x23: 00000000400003c5 x22: 0000ffffccf3e510 x21: ffff089c36fae148
x20: ffff80000ad30120 x19: ffffa8da8fcce148 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: ffffa8da8e44ea6c
x14: ffffa8da8e44e968 x13: ffffa8da8e03136c x12: 1fffe113804d6809
x11: ffff6113804d6809 x10: 0000000000000a60 x9 : dfff800000000000
x8 : ffff089c026b404f x7 : 00009eec7fb297f7 x6 : 0000000000000001
x5 : ffff80000ad30120 x4 : dfff800000000000 x3 : ffffa8da8e3016f4
x2 : 0000000000000003 x1 : 0000000000000000 x0 : 0000000000000000
Kernel panic - not syncing: kernel stack overflow
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xf8/0x108
show_stack+0x20/0x30
dump_stack_lvl+0x68/0x84
dump_stack+0x1c/0x38
panic+0x214/0x404
add_taint+0x0/0xf8
panic_bad_stack+0x144/0x160
handle_bad_stack+0x38/0x58
__bad_stack+0x78/0x7c
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
[...]
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
el1_interrupt+0x28/0x60
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x64/0x68
__ftrace_set_clr_event_nolock+0x98/0x198
__ftrace_set_clr_event+0x58/0x80
system_enable_write+0x144/0x178
vfs_write+0x174/0x738
ksys_write+0xd0/0x188
__arm64_sys_write+0x4c/0x60
invoke_syscall+0x64/0x180
el0_svc_common.constprop.0+0x84/0x160
do_el0_svc+0x48/0xe8
el0_svc+0x34/0xd0
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x190/0x194
SMP: stopping secondary CPUs
Kernel Offset: 0x28da86000000 from 0xffff800008000000
PHYS_OFFSET: 0xfffff76600000000
CPU features: 0x00000,01a00100,0000421b
Memory Limit: none

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/all/20221119040049.795065-1-zhengyejian1@huawei.com/
Fixes: aaf2bc50df1f ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# 7ea91307 12-Jan-2023 Zqiang <qiang1.zhang@intel.com>

rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early

According to the commit log of the patch that added it to the kernel,
start_poll_synchronize_rcu_expedited() can be invoked very early, as
in long before rcu_init() has been invoked. But before rcu_init(),
the rcu_data structure's ->mynode field has not yet been initialized.
This means that the start_poll_synchronize_rcu_expedited() function's
attempt to set the CPU's leaf rcu_node structure's ->exp_seq_poll_rq
field will result in a segmentation fault.

This commit therefore causes start_poll_synchronize_rcu_expedited() to
set ->exp_seq_poll_rq only after rcu_init() has initialized all CPUs'
rcu_data structures' ->mynode fields. It also removes the check from
the rcu_init() function so that start_poll_synchronize_rcu_expedited(
is unconditionally invoked. Yes, this might result in an unnecessary
boot-time grace period, but this is down in the noise.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# 46103fe0 18-Jan-2023 Zqiang <qiang1.zhang@intel.com>

rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()

The rcu_accelerate_cbs() function is invoked by rcu_report_qs_rdp()
only if there is a grace period in progress that is still blocked
by at least one CPU on this rcu_node structure. This means that
rcu_accelerate_cbs() should never return the value true, and thus that
this function should never set the needwake variable and in turn never
invoke rcu_gp_kthread_wake().

This commit therefore removes the needwake variable and the invocation
of rcu_gp_kthread_wake() in favor of a WARN_ON_ONCE() on the call to
rcu_accelerate_cbs(). The purpose of this new WARN_ON_ONCE() is to
detect situations where the system's opinion differs from ours.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# 09853fb8 04-Mar-2023 Paul E. McKenney <paulmck@kernel.org>

rcu: Add comment to rcu_do_batch() identifying rcuoc code path

This commit adds a comment to help explain why the "else" clause of the
in_serving_softirq() "if" statement does not need to enforce a time limit.
The reason is that this "else" clause handles rcuoc kthreads that do not
block handlers for other softirq vectors.

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# 5da7cb19 31-Mar-2023 Ziwei Dai <ziwei.dai@unisoc.com>

rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period

Memory passed to kvfree_rcu() that is to be freed is tracked by a
per-CPU kfree_rcu_cpu structure, which in turn contains pointers
to kvfree_rcu_bulk_data structures that contain pointers to memory
that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
structure that tracks the memory that has already been handed to RCU.
These structures track three categories of memory: (1) Memory for
kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
during an OOM episode. The first two categories are tracked in a
cache-friendly manner involving a dynamically allocated page of pointers
(the aforementioned kvfree_rcu_bulk_data structures), while the third
uses a simple (but decidedly cache-unfriendly) linked list through the
rcu_head structures in each block of memory.

On a given CPU, these three categories are handled as a unit, with that
CPU's kfree_rcu_cpu_work structure having one pointer for each of the
three categories. Clearly, new memory for a given category cannot be
placed in the corresponding kfree_rcu_cpu_work structure until any old
memory has had its grace period elapse and thus has been removed. And
the kfree_rcu_monitor() function does in fact check for this.

Except that the kfree_rcu_monitor() function checks these pointers one
at a time. This means that if the previous kfree_rcu() memory passed
to RCU had only category 1 and the current one has only category 2, the
kfree_rcu_monitor() function will send that current category-2 memory
along immediately. This can result in memory being freed too soon,
that is, out from under unsuspecting RCU readers.

To see this, consider the following sequence of events, in which:

o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
then is preempted.

o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
after a later grace period. Except that "from_cset" is freed
right after the previous grace period ended, so that "from_cset"
is immediately freed. Task A resumes and references "from_cset"'s
member, after which nothing good happens.

In full detail:

CPU 0 CPU 1
---------------------- ----------------------
count_memcg_event_mm()
|rcu_read_lock() <---
|mem_cgroup_from_task()
|// css_set_ptr is the "from_cset" mentioned on CPU 1
|css_set_ptr = rcu_dereference((task)->cgroups)
|// Hard irq comes, current task is scheduled out.

cgroup_attach_task()
|cgroup_migrate()
|cgroup_migrate_execute()
|css_set_move_task(task, from_cset, to_cset, true)
|cgroup_move_task(task, to_cset)
|rcu_assign_pointer(.., to_cset)
|...
|cgroup_migrate_finish()
|put_css_set_locked(from_cset)
|from_cset->refcount return 0
|kfree_rcu(cset, rcu_head) // free from_cset after new gp
|add_ptr_to_bulk_krc_lock()
|schedule_delayed_work(&krcp->monitor_work, ..)

kfree_rcu_monitor()
|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
|queue_rcu_work(system_wq, &krwp->rcu_work)
|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp

// There is a perious call_rcu(.., rcu_work_rcufn)
// gp end, rcu_work_rcufn() is called.
rcu_work_rcufn()
|__queue_work(.., rwork->wq, &rwork->work);

|kfree_rcu_work()
|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
|The "from_cset" is freed before new gp end.

// the task resumes some time later.
|css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.

This commit therefore causes kfree_rcu_monitor() to refrain from moving
kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
grace period has completed for all three categories.

v2: Use helper function instead of inserted code block at kfree_rcu_monitor().

Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# cf7066b9 11-Jan-2023 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Disable laziness if lazy-tracking says so

During suspend, we see failures to suspend 1 in 300-500 suspends.
Looking closer, it appears that asynchronous RCU callbacks are being
queued as lazy even though synchronous callbacks are expedited. These
delays appear to not be very welcome by the suspend/resume code as
evidenced by these occasional suspend failures.

This commit modifies call_rcu() to check if rcu_async_should_hurry(),
which will return true if we are in suspend or in-kernel boot.

[ paulmck: Alphabetize local variables. ]

Ignoring the lazy hint makes the 3000 suspend/resume cycles pass
reliably on a 12th gen 12-core Intel CPU, and there is some evidence
that it also slightly speeds up boot performance.

Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6efdda8b 11-Jan-2023 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Track laziness during boot and suspend

Boot and suspend/resume should not be slowed down in kernels built with
CONFIG_RCU_LAZY=y. In particular, suspend can sometimes fail in such
kernels.

This commit therefore adds rcu_async_hurry(), rcu_async_relax(), and
rcu_async_should_hurry() functions that track whether or not either
a boot or a suspend/resume operation is in progress. This will
enable a later commit to refrain from laziness during those times.

Export rcu_async_should_hurry(), rcu_async_hurry(), and rcu_async_relax()
for later use by rcutorture.

[ paulmck: Apply feedback from Steve Rostedt. ]

Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ccfe1fef 21-Dec-2022 Zqiang <qiang1.zhang@intel.com>

rcu: Remove redundant call to rcu_boost_kthread_setaffinity()

The rcu_boost_kthread_setaffinity() function is invoked at
rcutree_online_cpu() and rcutree_offline_cpu() time, early in the online
timeline and late in the offline timeline, respectively. It is also
invoked from rcutree_dead_cpu(), however, in the absence of userspace
manipulations (for which userspace must take responsibility), this call
is redundant with that from rcutree_offline_cpu(). This redundancy can
be demonstrated by printing out the relevant cpumasks

This commit therefore removes the call to rcu_boost_kthread_setaffinity()
from rcutree_dead_cpu().

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>


# be42f00b 19-Nov-2022 Zhen Lei <thunder.leizhen@huawei.com>

rcu: Add RCU stall diagnosis information

Because RCU CPU stall warnings are driven from the scheduling-clock
interrupt handler, a workload consisting of a very large number of
short-duration hardware interrupts can result in misleading stall-warning
messages. On systems supporting only a single level of interrupts,
that is, where interrupts handlers cannot be interrupted, this can
produce misleading diagnostics. The stack traces will show the
innocent-bystander interrupted task, not the interrupts that are
at the very least exacerbating the stall.

This situation can be improved by displaying the number of interrupts
and the CPU time that they have consumed. Diagnosing other types
of stalls can be eased by also providing the count of softirqs and
the CPU time that they consumed as well as the number of context
switches and the task-level CPU time consumed.

Consider the following output given this change:

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 0-....: (1250 ticks this GP) <omitted>
rcu: hardirqs softirqs csw/system
rcu: number: 624 45 0
rcu: cputime: 69 1 2425 ==> 2500(ms)

This output shows that the number of hard and soft interrupts is small,
there are no context switches, and the system takes up a lot of time. This
indicates that the current task is looping with preemption disabled.

The impact on system performance is negligible because snapshot is
recorded only once for all continuous RCU stalls.

This added debugging information is suppressed by default and can be
enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or
by booting with rcupdate.rcu_cpu_stall_cputime=1.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2ca836b1 14-Dec-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Split ready for reclaim objects from a batch

This patch splits the lists of objects so as to avoid sending any
through RCU that have already been queued for more than one grace
period. These long-term-resident objects are immediately freed.
The remaining short-term-resident objects are queued for later freeing
using queue_rcu_work().

This change avoids delaying workqueue handlers with synchronize_rcu()
invocations. Yes, workqueue handlers are designed to handle blocking,
but avoiding blocking when unnecessary improves performance during
low-memory situations.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4c33464a 14-Dec-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Carefully reset number of objects in krcp

The schedule_delayed_monitor_work() function relies on the count of
objects queued into any given kfree_rcu_cpu structure. This count is
used to determine how quickly to schedule passing these objects to RCU.

There are three pipes where pointers can be placed. When any pipe is
offloaded, the kfree_rcu_cpu structure's ->count counter is set to zero,
which is wrong because the other pipes might still be non-empty.

This commit therefore maintains per-pipe counters, and introduces a
krc_count() helper to access the aggregate value of those counters.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 96274561 02-Dec-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Use READ_ONCE() when access to krcp->head

The need_offload_krc() function is now lock-free, which gives the
compiler freedom to load old values from plain C-language loads from
the kfree_rcu_cpu struture's ->head pointer. This commit therefore
applied READ_ONCE() to these loads.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# cc37d520 29-Nov-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Use a polled API to speedup a reclaim process

Currently all objects placed into a batch wait for a full grace period
to elapse after that batch is ready to send to RCU. However, this
can unnecessarily delay freeing of the first objects that were added
to the batch. After all, several RCU grace periods might have elapsed
since those objects were added, and if so, there is no point in further
deferring their freeing.

This commit therefore adds per-page grace-period snapshots which are
obtained from get_state_synchronize_rcu(). When the batch is ready
to be passed to call_rcu(), each page's snapshot is checked by passing
it to poll_state_synchronize_rcu(). If a given page's RCU grace period
has already elapsed, its objects are freed immediately by kvfree_rcu_bulk().
Otherwise, these objects are freed after a call to synchronize_rcu().

This approach requires that the pages be traversed in reverse order,
that is, the oldest ones first.

Test example:

kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
--kconfig CONFIG_NR_CPUS=64 \
--kconfig CONFIG_RCU_NOCB_CPU=y \
--kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
--kconfig CONFIG_RCU_LAZY=n \
--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \
rcuscale.holdoff=20 rcuscale.kfree_loops=10000 \
torture.disable_onoff_at_boot" --trust-make

Before this commit:

Total time taken by all kfree'ers: 8535693700 ns, loops: 10000, batches: 1188, memory footprint: 2248MB
Total time taken by all kfree'ers: 8466933582 ns, loops: 10000, batches: 1157, memory footprint: 2820MB
Total time taken by all kfree'ers: 5375602446 ns, loops: 10000, batches: 1130, memory footprint: 6502MB
Total time taken by all kfree'ers: 7523283832 ns, loops: 10000, batches: 1006, memory footprint: 3343MB
Total time taken by all kfree'ers: 6459171956 ns, loops: 10000, batches: 1150, memory footprint: 6549MB

After this commit:

Total time taken by all kfree'ers: 8560060176 ns, loops: 10000, batches: 1787, memory footprint: 61MB
Total time taken by all kfree'ers: 8573885501 ns, loops: 10000, batches: 1777, memory footprint: 93MB
Total time taken by all kfree'ers: 8320000202 ns, loops: 10000, batches: 1727, memory footprint: 66MB
Total time taken by all kfree'ers: 8552718794 ns, loops: 10000, batches: 1790, memory footprint: 75MB
Total time taken by all kfree'ers: 8601368792 ns, loops: 10000, batches: 1724, memory footprint: 62MB

The reduction in memory footprint is well in excess of an order of
magnitude.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8fc5494a 29-Nov-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Move need_offload_krc() out of krcp->lock

The need_offload_krc() function currently holds the krcp->lock in order
to safely check krcp->head. This commit removes the need for this lock
in that function by updating the krcp->head pointer using WRITE_ONCE()
macro so that readers can carry out lockless loads of that pointer.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8c15a9e8 29-Nov-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Move bulk/list reclaim to separate functions

The kvfree_rcu() code maintains lists of pages of pointers, but also a
singly linked list, with the latter being used when memory allocation
fails. Traversal of these two types of lists is currently open coded.
This commit simplifies the code by providing kvfree_rcu_bulk() and
kvfree_rcu_list() functions, respectively, to traverse these two types
of lists. This patch does not introduce any functional change.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 27538e18 29-Nov-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Switch to a generic linked list API

This commit improves the readability and maintainability of the
kvfree_rcu() code by switching from an open-coded linked list to
the standard Linux-kernel circular doubly linked list. This patch
does not introduce any functional change.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 04a522b7 25-Oct-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu: Refactor kvfree_call_rcu() and high-level helpers

Currently a kvfree_call_rcu() takes an offset within a structure as
a second parameter, so a helper such as a kvfree_rcu_arg_2() has to
convert rcu_head and a freed ptr to an offset in order to pass it. That
leads to an extra conversion on macro entry.

Instead of converting, refactor the code in way that a pointer that has
to be freed is passed directly to the kvfree_call_rcu().

This patch does not make any functional change and is transparent to
all kvfree_rcu() users.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 748bf47a 19-Dec-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Test synchronous RCU grace periods at the end of rcu_init()

This commit tests synchronize_rcu() and synchronize_rcu_expedited()
at the end of rcu_init(), in addition to the test already at the
beginning of that function. These tests are run only in kernels built
with CONFIG_PROVE_RCU=y.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3d1adf7a 14-Dec-2022 Zqiang <qiang1.zhang@intel.com>

rcu: Make rcu_blocking_is_gp() stop early-boot might_sleep()

Currently, rcu_blocking_is_gp() invokes might_sleep() even during early
boot when interrupts are disabled and before the scheduler is scheduling.
This is at best an accident waiting to happen. Therefore, this commit
moves that might_sleep() under an rcu_scheduler_active check in order
to ensure that might_sleep() is not invoked unless sleeping might actually
happen.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 95ff24ee 25-Nov-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Upgrade header comment for poll_state_synchronize_rcu()

This commit emphasizes the possibility of concurrent calls to
synchronize_rcu() and synchronize_rcu_expedited() causing one or
the other of the two grace periods being lost from the viewpoint of
poll_state_synchronize_rcu().

If you cannot afford to lose grace periods this way, you should
instead use the _full() variants of the polled RCU API, for
example, poll_state_synchronize_rcu_full().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 253cbbff 14-Nov-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Throttle callback invocation based on number of ready callbacks

Currently, rcu_do_batch() sizes its batches based on the total number
of callbacks in the callback list. This can result in some strange
choices, for example, if there was 12,800 callbacks in the list, but
only 200 were ready to invoke, RCU would invoke 100 at a time (12,800
shifted down by seven bits).

A more measured approach would use the number that were actually ready
to invoke, an approach that has become feasible only recently given the
per-segment ->seglen counts in ->cblist.

This commit therefore bases the batch limit on the number of callbacks
ready to invoke instead of on the total number of callbacks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5a04848d 06-Nov-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate initialization and CPU-hotplug code

This commit consolidates the initialization and CPU-hotplug code at
the end of kernel/rcu/tree.c. This is strictly a code-motion commit.
No functionality has changed.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3f6c3d29 15-Dec-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't assert interrupts enabled too early in boot

The rcu_poll_gp_seq_end() and rcu_poll_gp_seq_end_unlocked() both check
that interrupts are enabled, as they normally should be when waiting for
an RCU grace period. Except that it is legal to wait for grace periods
during early boot, before interrupts have been enabled for the first time,
and polling for grace periods is required to work during this time.
This can result in false-positive lockdep splats in the presence of
boot-time-initiated tracing.

This commit therefore conditions those interrupts-enabled checks on
rcu_scheduler_active having advanced past RCU_SCHEDULER_INACTIVE, by
which time interrupts have been enabled.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 3cb278e7 16-Oct-2022 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Make call_rcu() lazy to save power

Implement timer-based RCU callback batching (also known as lazy
callbacks). With this we save about 5-10% of power consumed due
to RCU requests that happen when system is lightly loaded or idle.

By default, all async callbacks (queued via call_rcu) are marked
lazy. An alternate API call_rcu_hurry() is provided for the few users,
for example synchronize_rcu(), that need the old behavior.

The batch is flushed whenever a certain amount of time has passed, or
the batch on a particular CPU grows too big. Also memory pressure will
flush it in a future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists which were originally introduced to
address lock contention, to handle lazy CBs as well. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

[ paulmck: Fix formatting of inline call_rcu_lazy() definition. ]
[ paulmck: Apply Zqiang feedback. ]
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Suggested-by: Paul McKenney <paulmck@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ceb1c8c9 12-Oct-2022 Zqiang <qiang1.zhang@intel.com>

rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()

Running rcutorture with non-zero fqs_duration module parameter in a
kernel built with CONFIG_PREEMPTION=y results in the following splat:

BUG: using __this_cpu_read() in preemptible [00000000]
code: rcu_torture_fqs/398
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 398 Comm: rcu_torture_fqs Not tainted 6.0.0-rc1-yoctodev-standard+
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x86
dump_stack+0x10/0x16
check_preemption_disabled+0xe5/0xf0
__this_cpu_preempt_check+0x13/0x20
rcu_force_quiescent_state.part.0+0x1c/0x170
rcu_force_quiescent_state+0x1e/0x30
rcu_torture_fqs+0xca/0x160
? rcu_torture_boost+0x430/0x430
kthread+0x192/0x1d0
? kthread_complete_and_exit+0x30/0x30
ret_from_fork+0x22/0x30
</TASK>

The problem is that rcu_force_quiescent_state() uses __this_cpu_read()
in preemptible code instead of the proper raw_cpu_read(). This commit
therefore changes __this_cpu_read() to raw_cpu_read().

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# fdbdb868 25-Sep-2022 Yipeng Zou <zouyipeng@huawei.com>

rcu: Remove rcu_is_idle_cpu()

The commit 3fcd6a230fa7 ("x86/cpu: Avoid cpuinfo-induced IPIing of
idle CPUs") introduced rcu_is_idle_cpu() in order to identify the
current CPU idle state. But commit f3eca381bd49 ("x86/aperfmperf:
Replace arch_freq_get_on_cpu()") switched to using MAX_SAMPLE_AGE,
so rcu_is_idle_cpu() is no longer used. This commit therefore removes it.

Fixes: f3eca381bd49 ("x86/aperfmperf: Replace arch_freq_get_on_cpu()")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b8f7aca3 16-Oct-2022 Frederic Weisbecker <frederic@kernel.org>

rcu: Fix missing nocb gp wake on rcu_barrier()

In preparation for RCU lazy changes, wake up the RCU nocb gp thread if
needed after an entrain. This change prevents the RCU barrier callback
from waiting in the queue for several seconds before the lazy callbacks
in front of it are serviced.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# aba9645b 17-Sep-2022 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Use READ_ONCE() for lockless read of rnp->qsmask

The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This
may help avoid load tearing due to concurrent access, KCSAN
issues, and preserve sanity of people reading the mask in tracing.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d6fd907a 30-Aug-2022 Zqiang <qiang1.zhang@intel.com>

rcu: Remove duplicate RCU exp QS report from rcu_report_dead()

The rcu_report_dead() function invokes rcu_report_exp_rdp() in order
to force an immediate expedited quiescent state on the outgoing
CPU, and then it invokes rcu_preempt_deferred_qs() to provide any
required deferred quiescent state of either sort. Because the call to
rcu_preempt_deferred_qs() provides the expedited RCU quiescent state if
requested, the call to rcu_report_exp_rdp() is potentially redundant.

One possible issue is a concurrent start of a new expedited RCU
grace period, but this situation is already handled correctly
by __sync_rcu_exp_select_node_cpus(). This function will detect
that the CPU is going offline via the error return from its call
to smp_call_function_single(). In that case, it will retry, and
eventually stop retrying due to rcu_report_exp_rdp() clearing the
->qsmaskinitnext bit corresponding to the target CPU. As a result,
__sync_rcu_exp_select_node_cpus() will report the necessary quiescent
state after dealing with any remaining CPU.

This change assumes that control does not enter rcu_report_dead() within
an RCU read-side critical section, but then again, the surviving call
to rcu_preempt_deferred_qs() has always made this assumption.

This commit therefore removes the call to rcu_report_exp_rdp(), thus
relying on rcu_preempt_deferred_qs() to handle both normal and expedited
quiescent states.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 31d8aaa8 20-Oct-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Keep synchronize_rcu() from enabling irqs in early boot

Making polled RCU grace periods account for expedited grace periods
required acquiring the leaf rcu_node structure's lock during early boot,
but after rcu_init() was called. This lock is irq-disabled, but the
code incorrectly assumes that irqs are always disabled when invoking
synchronize_rcu(). The exception is early boot before the scheduler has
started, which means that upon return from synchronize_rcu(), irqs will
be incorrectly enabled.

This commit fixes this bug by using irqsave/irqrestore locking primitives.

Fixes: bf95b2bc3e42 ("rcu: Switch polled grace-period APIs to ->gp_seq_polled")

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 528262f5 18-Jul-2022 Zqiang <qiang1.zhang@intel.com>

rcu-tasks: Make RCU Tasks Trace check for userspace execution

Userspace execution is a valid quiescent state for RCU Tasks Trace,
but the scheduling-clock interrupt does not currently report such
quiescent states.

Of course, the scheduling-clock interrupt is not strictly speaking
userspace execution. However, the only way that this code is not
in a quiescent state is if something invoked rcu_read_lock_trace(),
and that would be reflected in the ->trc_reader_nesting field in
the task_struct structure. Furthermore, this field is checked by
rcu_tasks_trace_qs(), which is invoked by rcu_tasks_qs() which is in
turn invoked by rcu_note_voluntary_context_switch() in kernels building
at least one of the RCU Tasks flavors. It is therefore safe to invoke
rcu_tasks_trace_qs() from the rcu_sched_clock_irq().

But rcu_tasks_qs() also invokes rcu_tasks_classic_qs() for RCU
Tasks, which lacks the read-side markers provided by RCU Tasks Trace.
This raises the possibility that an RCU Tasks grace period could start
after the interrupt from userspace execution, but before the call to
rcu_sched_clock_irq(). However, it turns out that this is safe because
the RCU Tasks grace period waits for an RCU grace period, which will
wait for the entire scheduling-clock interrupt handler, including any
RCU Tasks read-side critical section that this handler might contain.

This commit therefore updates the rcu_sched_clock_irq() function's
check for usermode execution and its call to rcu_tasks_classic_qs()
to instead check for both usermode execution and interrupt from idle,
and to instead call rcu_note_voluntary_context_switch(). This
consolidates code and provides more faster RCU Tasks Trace
reporting of quiescent states in kernels that do scheduling-clock
interrupts for userspace execution.

[ paulmck: Consolidate checks into rcu_sched_clock_irq(). ]

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d761de8a 05-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Make synchronize_rcu() fastpath update only boot-CPU counters

Large systems can have hundreds of rcu_node structures, and updating
counters in each of them might slow down booting. This commit therefore
updates only the counters in those rcu_node structures corresponding
to the boot CPU, up to and including the root rcu_node structure.

The counters for the remaining rcu_node structures are updated by the
rcu_scheduler_starting() function, which executes just before the first
non-boot kthread is spawned.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7ecef087 04-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove ->rgos_polled field from rcu_gp_oldstate structure

Because both normal and expedited grace periods increment their respective
counters on their pre-scheduler early boot fastpaths, the rcu_gp_oldstate
structure no longer needs its ->rgos_polled field. This commit therefore
removes this field, shrinking this structure so that it is the same size
as an rcu_head structure.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 910e1209 04-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Make synchronize_rcu() fast path update ->gp_seq counters

This commit causes the early boot single-CPU synchronize_rcu() fastpath to
update the rcu_state and rcu_node structures' ->gp_seq and ->gp_seq_needed
counters. This will allow the full-state polled grace-period APIs to
detect all normal grace periods without the need to track the special
combined polling-only counter, which is a step towards removing the
->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size
by one third.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5f11bad6 04-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu-tasks: Remove grace-period fast-path rcu-tasks helper

Now that the grace-period fast path can only happen during the
pre-scheduler portion of early boot, this fast path can no longer block
run-time RCU Tasks and RCU Tasks Trace grace periods. This commit
therefore removes the conditional cond_resched_tasks_rcu_qs() invocation.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a5d1b0b6 04-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Set rcu_data structures' initial ->gpwrap value to true

It would be good do reduce the size of the rcu_gp_oldstate structure
from three unsigned long instances to two, but this requires that the
boot-time optimized grace periods update the various ->gp_seq fields.
Updating these fields in the rcu_state structure and in all of the
rcu_node structures is at least semi-reasonable, but updating them in
all of the rcu_data structures is a bridge too far. This means that if
there are too many early boot-time grace periods, the ->gp_seq field in
the rcu_data structure cannot be trusted. This commit therefore sets
each rcu_data structure's ->gpwrap field to provide the necessary impetus
for a suitable level of distrust.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 258f887a 04-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Disable run-time single-CPU grace-period optimization

The run-time single-CPU grace-period optimization applies only to
kernels built with CONFIG_SMP=y && CONFIG_PREEMPTION=y that are running
on a single-CPU system. But a kernel intended for a single-CPU system
should instead be built with CONFIG_SMP=n, and in any case, single-CPU
systems running Linux no longer appear to be the common case. Plus this
optimization results in the rcu_gp_oldstate structure being half again
larger than it needs to be.

This commit therefore disables the run-time single-CPU grace-period
optimization, so that this optimization applies only during the
pre-scheduler portion of the boot sequence.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b6fe4917 04-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Add full-sized polling for cond_sync_full()

The cond_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.

This commit therefore adds yet another member of the full-state RCU
grace-period polling API, which is the cond_synchronize_rcu_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.

[ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f21e0143 03-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove blank line from poll_state_synchronize_rcu() docbook header

This commit removes the blank line preceding the oldstate parameter to
the docbook header for the poll_state_synchronize_rcu() function and
marks uses of this parameter later in that header.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 76ea3641 02-Aug-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Add full-sized polling for start_poll()

The start_poll_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.

This commit therefore adds the next member of the full-state RCU
grace-period polling API, namely the start_poll_synchronize_rcu_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3fdefca9 28-Jul-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Add full-sized polling for get_state()

The get_state_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.

This commit therefore adds the next member of the full-state RCU
grace-period polling API, namely the get_state_synchronize_rcu_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 91a967fd 28-Jul-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Add full-sized polling for get_completed*() and poll_state*()

The get_completed_synchronize_rcu() and poll_state_synchronize_rcu()
APIs compress the combined expedited and normal grace-period states into a
single unsigned long, which conserves storage, but can miss grace periods
in certain cases involving overlapping normal and expedited grace periods.
Missing the occasional grace period is usually not a problem, but there
are use cases that care about each and every grace period.

This commit therefore adds the first members of the full-state RCU
grace-period polling API, namely the get_completed_synchronize_rcu_full()
and poll_state_synchronize_rcu_full() functions. These use up to three
times the storage (rcu_gp_oldstate structure instead of unsigned long),
but which are guaranteed not to miss grace periods, at least in situations
where the single-CPU grace-period optimization does not apply.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 51824b78 30-Jun-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval

Currently the monitor work is scheduled with a fixed interval of HZ/20,
which is roughly 50 milliseconds. The drawback of this approach is
low utilization of the 512 page slots in scenarios with infrequence
kvfree_rcu() calls. For example on an Android system:

<snip>
kworker/3:3-507 [003] .... 470.286305: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=6
kworker/6:1-76 [006] .... 470.416613: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000ea0d6556 nr_records=1
kworker/6:1-76 [006] .... 470.416625: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000003e025849 nr_records=9
kworker/3:3-507 [003] .... 471.390000: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000815a8713 nr_records=48
kworker/1:1-73 [001] .... 471.725785: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000fda9bf20 nr_records=3
kworker/1:1-73 [001] .... 471.725833: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000a425b67b nr_records=76
kworker/0:4-1411 [000] .... 472.085673: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007996be9d nr_records=1
kworker/0:4-1411 [000] .... 472.085728: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=5
kworker/6:1-76 [006] .... 472.260340: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000065630ee4 nr_records=102
<snip>

In many cases, out of 512 slots, fewer than 10 were actually used.
In order to improve batching and make utilization more efficient this
commit sets a drain interval to a fixed 5-seconds interval. Floods are
detected when a page fills quickly, and in that case, the reclaim work
is re-scheduled for the next scheduling-clock tick (jiffy).

After this change:

<snip>
kworker/7:1-371 [007] .... 5630.725708: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000005ab0ffb3 nr_records=121
kworker/7:1-371 [007] .... 5630.989702: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000060c84761 nr_records=47
kworker/7:1-371 [007] .... 5630.989714: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000000babf308 nr_records=510
kworker/7:1-371 [007] .... 5631.553790: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000bb7bd0ef nr_records=169
kworker/7:1-371 [007] .... 5631.553808: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000044c78753 nr_records=510
kworker/5:6-9428 [005] .... 5631.746102: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d98519aa nr_records=123
kworker/4:7-9434 [004] .... 5632.001758: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000526c9d44 nr_records=322
kworker/4:7-9434 [004] .... 5632.002073: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000002c6a8afa nr_records=185
kworker/7:1-371 [007] .... 5632.277515: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007f4a962f nr_records=510
<snip>

Here, all but one of the cases, more than one hundreds slots were used,
representing an order-of-magnitude improvement.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 38269096 22-Jun-2022 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/kfree: Fix kfree_rcu_shrink_count() return value

As per the comments in include/linux/shrinker.h, .count_objects callback
should return the number of freeable items, but if there are no objects
to free, SHRINK_EMPTY should be returned. The only time 0 is returned
should be when we are unable to determine the number of objects, or the
cache should be skipped for another reason.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 093590c1 22-Jun-2022 Michal Hocko <mhocko@suse.com>

rcu: Back off upon fill_page_cache_func() allocation failure

The fill_page_cache_func() function allocates couple of pages to store
kvfree_rcu_bulk_data structures. This is a lightweight (GFP_NORETRY)
allocation which can fail under memory pressure. The function will,
however keep retrying even when the previous attempt has failed.

This retrying is in theory correct, but in practice the allocation is
invoked from workqueue context, which means that if the memory reclaim
gets stuck, these retries can hog the worker for quite some time.
Although the workqueues subsystem automatically adjusts concurrency, such
adjustment is not guaranteed to happen until the worker context sleeps.
And the fill_page_cache_func() function's retry loop is not guaranteed
to sleep (see the should_reclaim_retry() function).

And we have seen this function cause workqueue lockups:

kernel: BUG: workqueue lockup - pool cpus=93 node=1 flags=0x1 nice=0 stuck for 32s!
[...]
kernel: pool 74: cpus=37 node=0 flags=0x1 nice=0 hung=32s workers=2 manager: 2146
kernel: pwq 498: cpus=249 node=1 flags=0x1 nice=0 active=4/256 refcnt=5
kernel: in-flight: 1917:fill_page_cache_func
kernel: pending: dbs_work_handler, free_work, kfree_rcu_monitor

Originally, we thought that the root cause of this lockup was several
retries with direct reclaim, but this is not yet confirmed. Furthermore,
we have seen similar lockups without any heavy memory pressure. This
suggests that there are other factors contributing to these lockups.
However, it is not really clear that endless retries are desireable.

So let's make the fill_page_cache_func() function back off after
allocation failure.

Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e33c267a 31-May-2022 Roman Gushchin <roman.gushchin@linux.dev>

mm: shrinkers: provide shrinkers with names

Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.

This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.

In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.

The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40

[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# d96c52fe 15-Apr-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Add polled expedited grace-period primitives

This commit adds expedited grace-period functionality to RCU's polled
grace-period API, adding start_poll_synchronize_rcu_expedited() and
cond_synchronize_rcu_expedited(), which are similar to the existing
start_poll_synchronize_rcu() and cond_synchronize_rcu() functions,
respectively.

Note that although start_poll_synchronize_rcu_expedited() can be invoked
very early, the resulting expedited grace periods are not guaranteed
to start until after workqueues are fully initialized. On the other
hand, both synchronize_rcu() and synchronize_rcu_expedited() can also
be invoked very early, and the resulting grace periods will be taken
into account as they occur.

[ paulmck: Apply feedback from Neeraj Upadhyay. ]

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# dd041405 14-Apr-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Make polled grace-period API account for expedited grace periods

Currently, this code could splat:

oldstate = get_state_synchronize_rcu();
synchronize_rcu_expedited();
WARN_ON_ONCE(!poll_state_synchronize_rcu(oldstate));

This situation is counter-intuitive and user-unfriendly. After all, there
really was a perfectly valid full grace period right after the call to
get_state_synchronize_rcu(), so why shouldn't poll_state_synchronize_rcu()
know about it?

This commit therefore makes the polled grace-period API aware of expedited
grace periods in addition to the normal grace periods that it is already
aware of. With this change, the above code is guaranteed not to splat.

Please note that the above code can still splat due to counter wrap on the
one hand and situations involving partially overlapping normal/expedited
grace periods on the other. On 64-bit systems, the second is of course
much more likely than the first. It is possible to modify this approach
to prevent overlapping grace periods from causing splats, but only at
the expense of greatly increasing the probability of counter wrap, as
in within milliseconds on 32-bit systems and within minutes on 64-bit
systems.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# bf95b2bc 13-Apr-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Switch polled grace-period APIs to ->gp_seq_polled

This commit switches the existing polled grace-period APIs to use a
new ->gp_seq_polled counter in the rcu_state structure. An additional
->gp_seq_polled_snap counter in that same structure allows the normal
grace period kthread to interact properly with the !SMP !PREEMPT fastpath
through synchronize_rcu(). The first of the two to note the end of a
given grace period will make knowledge of this transition available to
the polled API.

This commit is in preparation for polled expedited grace periods.

[ paulmck: Fix use of rcu_state.gp_seq_polled to start normal grace period. ]

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8f489b4d 11-May-2022 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/nocb: Add option to opt rcuo kthreads out of RT priority

This commit introduces a RCU_NOCB_CPU_CB_BOOST Kconfig option that
prevents rcuo kthreads from running at real-time priority, even in
kernels built with RCU_BOOST. This capability is important to devices
needing low-latency (as in a few milliseconds) response from expedited
RCU grace periods, but which are not running a classic real-time workload.
On such devices, permitting the rcuo kthreads to run at real-time priority
results in unacceptable latencies imposed on the application tasks,
which run as SCHED_OTHER.

See for example the following trace output:

<snip>
<...>-60 [006] d..1 2979.028717: rcu_batch_start: rcu_preempt CBs=34619 bl=270
<snip>

If that rcuop kthread were permitted to run at real-time SCHED_FIFO
priority, it would monopolize its CPU for hundreds of milliseconds
while invoking those 34619 RCU callback functions, which would cause an
unacceptably long latency spike for many application stacks on Android
platforms.

However, some existing real-time workloads require that callback
invocation run at SCHED_FIFO priority, for example, those running on
systems with heavy SCHED_OTHER background loads. (It is the real-time
system's administrator's responsibility to make sure that important
real-time tasks run at a higher priority than do RCU's kthreads.)

Therefore, this new RCU_NOCB_CPU_CB_BOOST Kconfig option defaults to
"y" on kernels built with PREEMPT_RT and defaults to "n" otherwise.
The effect is to preserve current behavior for real-time systems, but for
other systems to allow expedited RCU grace periods to run with real-time
priority while continuing to invoke RCU callbacks as SCHED_OTHER.

As you would expect, this RCU_NOCB_CPU_CB_BOOST Kconfig option has no
effect except on CPUs with offloaded RCU callbacks.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>


# 51038506 29-Apr-2022 Zqiang <qiang1.zhang@intel.com>

rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()

Callbacks are invoked in RCU kthreads when calbacks are offloaded
(rcu_nocbs boot parameter) or when RCU's softirq handler has been
offloaded to rcuc kthreads (use_softirq==0). The current code allows
for the rcu_nocbs case but not the use_softirq case. This commit adds
support for the use_softirq case.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>


# a03ae49c 08-Jun-2022 Neeraj Upadhyay <quic_neeraju@quicinc.com>

rcu/tree: Add comment to describe GP-done condition in fqs loop

Add a comment to explain why !rcu_preempt_blocked_readers_cgp() condition
is required on root rnp node, for GP completion check in rcu_gp_fqs_loop().

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 9bdb5b3a 08-Jun-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()

This commit saves a line of code by initializing the rcu_gp_fqs()
function's first_gp_fqs local variable in its declaration.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 82d26c36 02-Jun-2022 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/kvfree: Remove useless monitor_todo flag

monitor_todo is not needed as the work struct already tracks
if work is pending. Just use that to know if work is pending
using schedule_delayed_work() helper.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>


# e2bb1288 25-May-2022 Zqiang <qiang1.zhang@intel.com>

rcu: Cleanup RCU urgency state for offline CPU

When a CPU is slow to provide a quiescent state for a given grace
period, RCU takes steps to encourage that CPU to get with the
quiescent-state program in a more timely fashion. These steps
include these flags in the rcu_data structure:

1. ->rcu_urgent_qs, which causes the scheduling-clock interrupt to
request an otherwise pointless context switch from the scheduler.

2. ->rcu_need_heavy_qs, which causes both cond_resched() and RCU's
context-switch hook to do an immediate momentary quiscent state.

3. ->rcu_need_heavy_qs, which causes the scheduler-clock tick to
be enabled even on nohz_full CPUs with only one runnable task.

These flags are of course cleared once the corresponding CPU has passed
through a quiescent state. Unless that quiescent state is the CPU
going offline, which means that when the CPU comes back online, it will
needlessly consume additional CPU time and incur additional latency,
which constitutes a minor but very real performance bug.

This commit therefore adds the call to rcu_disable_urgency_upon_qs()
that clears these flags to the CPU-hotplug offlining code path.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>


# 52c1d81e 05-May-2022 Zqiang <qiang1.zhang@intel.com>

rcu: Add rnp->cbovldmask check in rcutree_migrate_callbacks()

Currently, the rcu_node structure's ->cbovlmask field is set in call_rcu()
when a given CPU is suffering from callback overload. But if that CPU
goes offline, the outgoing CPU's callbacks is migrated to the running
CPU, which is likely to overload the running CPU. However, that CPU's
bit in its leaf rcu_node structure's ->cbovlmask field remains zero.

Initially, this is OK because the outgoing CPU's bit remains set.
However, that bit will be cleared at the next end of a grace period,
at which time it is quite possible that the running CPU will still
be overloaded. If the running CPU invokes call_rcu(), then overload
will be checked for and the bit will be set. Except that there is no
guarantee that the running CPU will invoke call_rcu(), in which case the
next grace period will fail to take the running CPU's overload condition
into account. Plus, because the bit is not set, the end of the grace
period won't check for overload on this CPU.

This commit therefore adds a call to check_cb_ovld_locked() in
rcutree_migrate_callbacks() to set the running CPU's ->cbovlmask bit
appropriately.

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>


# fb77dccf 12-Apr-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Decrease FQS scan wait time in case of callback overloading

The force-quiesce-state loop function rcu_gp_fqs_loop() checks for
callback overloading and does an immediate initial scan for idle CPUs
if so. However, subsequent rescans will be carried out at as leisurely a
rate as they always are, as specified by the rcutree.jiffies_till_next_fqs
module parameter. It might be tempting to just continue immediately
rescanning, but this turns the RCU grace-period kthread into a CPU hog.
It might also be tempting to reduce the time between rescans to a single
jiffy, but this can be problematic on larger systems.

This commit therefore divides the normal time between rescans by three,
rounding up. Thus a small system running at HZ=1000 that is suffering
from callback overload will wait only one jiffy instead of the normal
three between rescans.

[ paulmck: Apply Neeraj Upadhyay feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>


# 17147677 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

context_tracking: Convert state to atomic_t

Context tracking's state and dynticks counter are going to be merged
in a single field so that both updates can happen atomically and at the
same time. Prepare for that with converting the state into an atomic_t.

[ paulmck: Apply kernel test robot feedback. ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# 17211455 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

rcu/context-tracking: Move RCU-dynticks internal functions to context_tracking

Move the core RCU eqs/dynticks functions to context tracking so that
we can later merge all that code within context tracking.

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# 56450649 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

rcu/context-tracking: Move deferred nocb resched to context tracking

To prepare for migrating the RCU eqs accounting code to context tracking,
split the last-resort deferred nocb resched from rcu_user_enter() and
move it into a separate call from context tracking.

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# 95e04f48 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

rcu/context_tracking: Move dynticks_nmi_nesting to context tracking

The RCU eqs tracking is going to be performed by the context tracking
subsystem. The related nesting counters thus need to be moved to the
context tracking structure.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# 904e600e 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

rcu/context_tracking: Move dynticks_nesting to context tracking

The RCU eqs tracking is going to be performed by the context tracking
subsystem. The related nesting counters thus need to be moved to the
context tracking structure.

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# 62e2412d 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

rcu/context_tracking: Move dynticks counter to context tracking

In order to prepare for merging RCU dynticks counter into the context
tracking state, move the rcu_data's dynticks field to the context
tracking structure. It will later be mixed within the context tracking
state itself.

[ paulmck: Move enum ctx_state into global scope. ]

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# 3864caaf 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

rcu/context-tracking: Remove rcu_irq_enter/exit()

Now rcu_irq_enter/exit() is an unnecessary middle call between
ct_irq_enter/exit() and nmi_irq_enter/exit(). Take this opportunity
to remove the former functions and move the comments above them to the
new entrypoints.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# e67198cc 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

context_tracking: Take idle eqs entrypoints over RCU

The RCU dynticks counter is going to be merged into the context tracking
subsystem. Start with moving the idle extended quiescent states
entrypoints to context tracking. For now those are dumb redirections to
existing RCU calls.

[ paulmck: Apply kernel test robot feedback. ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# ed4ae5ef 17-May-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Apply noinstr to rcu_idle_enter() and rcu_idle_exit()

This commit applies the "noinstr" tag to the rcu_idle_enter() and
rcu_idle_exit() functions, which are invoked from portions of the idle
loop that cannot be instrumented. These tags require reworking the
rcu_eqs_enter() and rcu_eqs_exit() functions that these two functions
invoke in order to cause them to use normal assertions rather than
lockdep. In addition, within rcu_idle_exit(), the raw versions of
local_irq_save() and local_irq_restore() are used, again to avoid issues
with lockdep in uninstrumented code.

This patch is based in part on an earlier patch by Jiri Olsa, discussions
with Peter Zijlstra and Frederic Weisbecker, earlier changes by Thomas
Gleixner, and off-list discussions with Yonghong Song.

Link: https://lore.kernel.org/lkml/20220515203653.4039075-1-jolsa@kernel.org/
Reported-by: Jiri Olsa <jolsa@kernel.org>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Yonghong Song <yhs@fb.com>


# 414c1238 13-Apr-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide a get_completed_synchronize_rcu() function

It is currently up to the caller to handle stale return values from
get_state_synchronize_rcu(). If poll_state_synchronize_rcu() returned
true once, a grace period has elapsed, regardless of the fact that counter
wrap might cause some future poll_state_synchronize_rcu() invocation to
return false. For example, the caller might store a separate flag that
indicates whether some previous call to poll_state_synchronize_rcu()
determined that the relevant grace period had already ended.

This approach works, but it requires extra storage and is easy to get
wrong. This commit therefore introduces a get_completed_synchronize_rcu()
that returns a cookie that causes poll_state_synchronize_rcu() to always
return true. This already-completed cookie can be stored in place of the
cookie that previously caused poll_state_synchronize_rcu() to return true.
It can also be used to flag a given structure as not having been exposed
to readers, and thus not requiring a grace period to elapse.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2403e804 21-Mar-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Make normal polling GP be more precise about sequence numbers

Currently, poll_state_synchronize_rcu() uses rcu_seq_done() to check
whether the specified grace period has completed. However, rcu_seq_done()
does a simple comparison that reserves have of the sequence-number space
for uncompleted grace periods. This has the unfortunate side-effect
of not handling sequence-number wrap gracefully. Of course, one can
argue that if someone has already waited for half of the full range of
grace periods, they can wait for the other half, but why wait at all in
this case?

This commit therefore creates a rcu_seq_done_exact() that counts as
uncompleted only the two grace periods during which the sequence number
might have been handed out, while still being uncompleted. This way,
if sequence-number wrap happens to hit that range, at most two additional
grace periods need be waited for.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 9621fbee 08-Apr-2022 Kalesh Singh <kaleshsingh@google.com>

rcu: Move expedited grace period (GP) work to RT kthread_worker

Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period
latency because its workqueues run at SCHED_OTHER, and thus can be
delayed by normal processes. This commit avoids these delays by moving
the expedited GP work items to a real-time-priority kthread_worker.

This option is controlled by CONFIG_RCU_EXP_KTHREAD and disabled by
default on PREEMPT_RT=y kernels which disable expedited grace periods
after boot by unconditionally setting rcupdate.rcu_normal_after_boot=1.

The results were evaluated on arm64 Android devices (6GB ram) running
5.10 kernel, and capturing trace data in critical user-level code.

The table below shows the resulting order-of-magnitude improvements
in synchronize_rcu_expedited() latency:

------------------------------------------------------------------------
| | workqueues | kthread_worker | Diff |
------------------------------------------------------------------------
| Count | 725 | 688 | |
------------------------------------------------------------------------
| Min Duration (ns) | 326 | 447 | 37.12% |
------------------------------------------------------------------------
| Q1 (ns) | 39,428 | 38,971 | -1.16% |
------------------------------------------------------------------------
| Q2 - Median (ns) | 98,225 | 69,743 | -29.00% |
------------------------------------------------------------------------
| Q3 (ns) | 342,122 | 126,638 | -62.98% |
------------------------------------------------------------------------
| Max Duration (ns) | 372,766,967 | 2,329,671 | -99.38% |
------------------------------------------------------------------------
| Avg Duration (ns) | 2,746,353 | 151,242 | -94.49% |
------------------------------------------------------------------------
| Standard Deviation (ns) | 19,327,765 | 294,408 | |
------------------------------------------------------------------------

The below table show the range of maximums/minimums for
synchronize_rcu_expedited() latency from all experiments:

------------------------------------------------------------------------
| | workqueues | kthread_worker | Diff |
------------------------------------------------------------------------
| Total No. of Experiments | 25 | 23 | |
------------------------------------------------------------------------
| Largest Maximum (ns) | 372,766,967 | 2,329,671 | -99.38% |
------------------------------------------------------------------------
| Smallest Maximum (ns) | 38,819 | 86,954 | 124.00% |
------------------------------------------------------------------------
| Range of Maximums (ns) | 372,728,148 | 2,242,717 | |
------------------------------------------------------------------------
| Largest Minimum (ns) | 88,623 | 27,588 | -68.87% |
------------------------------------------------------------------------
| Smallest Minimum (ns) | 326 | 447 | 37.12% |
------------------------------------------------------------------------
| Range of Minimums (ns) | 88,297 | 27,141 | |
------------------------------------------------------------------------

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Wei Wang <wvw@google.com>
Tested-by: Kyle Lin <kylelin@google.com>
Tested-by: Chunwei Lu <chunweilu@google.com>
Tested-by: Lulu Wang <luluw@google.com>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 70ae7b0c 14-Mar-2022 Frederic Weisbecker <frederic@kernel.org>

rcu: Fix preemption mode check on synchronize_rcu[_expedited]()

An early check on synchronize_rcu[_expedited]() tries to determine if
the current CPU is in UP mode on an SMP no-preempt kernel, in which case
there is no need to start a grace period since the current assumed
quiescent state is all we need.

However the preemption mode doesn't take into account the boot selected
preemption mode under CONFIG_PREEMPT_DYNAMIC=y, missing a possible
early return if the running flavour is "none" or "voluntary".

Use the shiny new preempt mode accessors to fix this. However,
avoid invoking them during early boot because doing so triggers a
WARN_ON_ONCE().

[ paulmck: Update for mainlined API. ]

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 75182a4e 02-Mar-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Add comments to final rcu_gp_cleanup() "if" statement

The final "if" statement in rcu_gp_cleanup() has proven to be rather
confusing, straightforward though it might have seemed when initially
written. This commit therefore adds comments to its "then" and "else"
clauses to at least provide a more elevated form of confusion.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reported-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c708b08c 23-Feb-2022 Paul E. McKenney <paulmck@kernel.org>

rcu: Check for jiffies going backwards

A report of a 12-jiffy normal RCU CPU stall warning raises interesting
questions about the nature of time on the offending system. This commit
instruments rcu_sched_clock_irq(), which is RCU's hook into the
scheduling-clock interrupt, checking for the jiffies counter going
backwards.

Reported-by: Saravanan D <sarvanand@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 99d6a2ac 04-Feb-2022 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Suppress debugging grace period delays during flooding

Tree RCU supports grace-period delays using the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot
parameters. These delays are strictly for debugging purposes, and have
proven quite effective at exposing bugs involving race with CPU-hotplug
operations. However, these delays can result in false positives when
used in conjunction with callback flooding, for example, those generated
by the rcutorture.fwd_progress kernel boot parameter.

This commit therefore suppresses grace-period delays while callback
flooding is in progress.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5d900708 04-Mar-2022 Paul E. McKenney <paulmck@kernel.org>

rcu-tasks: Make Tasks RCU account for userspace execution

The main Tasks RCU quiescent state is voluntary context switch. However,
userspace execution is also a valid quiescent state, and is a valuable one
for userspace applications that spin repeatedly executing light-weight
non-sleeping system calls. Currently, such an application can delay a
Tasks RCU grace period for many tens of seconds.

This commit therefore enlists the aid of the scheduler-clock interrupt to
provide a Tasks RCU quiescent state when it interrupted a task executing
in userspace.

[ paulmck: Apply feedback from kernel test robot. ]

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Neil Spring <ntspring@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 87c5adf0 16-Feb-2022 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Initialize nocb kthreads only for boot CPU prior SMP initialization

The rcu_spawn_gp_kthread() function is called as an early initcall, which
means that SMP initialization hasn't happened yet and only the boot CPU is
online. Therefore, create only the NOCB kthreads related to the boot CPU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3352911f 16-Feb-2022 Frederic Weisbecker <frederic@kernel.org>

rcu: Initialize boost kthread only for boot node prior SMP initialization

The rcu_spawn_gp_kthread() function is called as an early initcall,
which means that SMP initialization hasn't happened yet and only the
boot CPU is online. Therefore, create only the boost kthread for the
leaf node of the boot CPU.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2eed973a 16-Feb-2022 Frederic Weisbecker <frederic@kernel.org>

rcu: Assume rcu_init() is called before smp

The rcu_init() function is called way before SMP is initialized and
therefore only the boot CPU should be online at this stage.

Simplify the boot per-cpu initialization accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 29845399 08-Feb-2022 Frederic Weisbecker <frederic@kernel.org>

tick/rcu: Remove obsolete rcu_needs_cpu() parameters

With the removal of CONFIG_RCU_FAST_NO_HZ, the parameters in
rcu_needs_cpu() are not necessary anymore. Simply remove them.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>


# d818cc76 25-Dec-2021 Zqiang <qiang1.zhang@intel.com>

kasan: Record work creation stack trace with interrupts enabled

Recording the work creation stack trace for KASAN reports in
call_rcu() is expensive, due to unwinding the stack, but also
due to acquiring depot_lock inside stackdepot (which may be contended).
Because calling kasan_record_aux_stack_noalloc() does not require
interrupts to already be disabled, this may unnecessarily extend
the time with interrupts disabled.

Therefore, move calling kasan_record_aux_stack() before the section
with interrupts disabled.

Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 1fe09ebe 18-Dec-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Inline __call_rcu() into call_rcu()

Because __call_rcu() is invoked only by call_rcu(), this commit inlines
the former into the latter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 218b957a 08-Dec-2021 David Woodhouse <dwmw@amazon.co.uk>

rcu: Add mutex for rcu boost kthread spawning and affinity setting

As we handle parallel CPU bringup, we will need to take care to avoid
spawning multiple boost threads, or race conditions when setting their
affinity. Spotted by Paul McKenney.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5ae0f1b5 10-Dec-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Create and use an rcu_rdp_cpu_online()

The pattern "rdp->grpmask & rcu_rnp_online_cpus(rnp)" occurs frequently
in RCU code in order to determine whether rdp->cpu is online from an
RCU perspective. This commit therefore creates an rcu_rdp_cpu_online()
function to replace it.

[ paulmck: Apply kernel test robot unused-variable feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 80b3fd47 14-Dec-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_barrier() no longer block CPU-hotplug operations

This commit removes the cpus_read_lock() and cpus_read_unlock() calls
from rcu_barrier(), thus allowing CPUs to come and go during the course
of rcu_barrier() execution. Posting of the ->barrier_head callbacks does
synchronize with portions of RCU's CPU-hotplug notifiers, but these locks
are held for short time periods on both sides. Thus, full CPU-hotplug
operations could both start and finish during the execution of a given
rcu_barrier() invocation.

Additional synchronization is provided by a global ->barrier_lock.
Since the ->barrier_lock is only used during rcu_barrier() execution and
during onlining/offlining a CPU, the contention for this lock should
be low. It might be tempting to make use of a per-CPU lock just on
general principles, but straightforward attempts to do this have the
problems shown below.

Initial state: 3 CPUs present, CPU 0 and CPU1 do not have
any callback and CPU2 has callbacks.

1. CPU0 calls rcu_barrier().

2. CPU1 starts offlining for CPU2. CPU1 calls
rcutree_migrate_callbacks(). rcu_barrier_entrain() is called
from rcutree_migrate_callbacks(), with CPU2's rdp->barrier_lock.
It does not entrain ->barrier_head for CPU2, as rcu_barrier()
on CPU0 hasn't started the barrier sequence (by calling
rcu_seq_start(&rcu_state.barrier_sequence)) yet.

3. CPU0 starts new barrier sequence. It iterates over
CPU0 and CPU1, after acquiring their per-cpu ->barrier_lock
and finds 0 segcblist length. It updates ->barrier_seq_snap
for CPU0 and CPU1 and continues loop iteration to CPU2.

for_each_possible_cpu(cpu) {
raw_spin_lock_irqsave(&rdp->barrier_lock, flags);
if (!rcu_segcblist_n_cbs(&rdp->cblist)) {
WRITE_ONCE(rdp->barrier_seq_snap, gseq);
raw_spin_unlock_irqrestore(&rdp->barrier_lock, flags);
rcu_barrier_trace(TPS("NQ"), cpu, rcu_state.barrier_sequence);
continue;
}

4. rcutree_migrate_callbacks() completes execution on CPU1.
Segcblist len for CPU2 becomes 0.

5. The loop iteration on CPU0, checks rcu_segcblist_n_cbs(&rdp->cblist)
for CPU2 and completes the loop iteration after setting
->barrier_seq_snap.

6. As there isn't any ->barrier_head callback entrained; at
this point, rcu_barrier() in CPU0 returns.

7. The callbacks, which migrated from CPU2 to CPU1, execute.

Straightforward per-CPU locking is also subject to the following race
condition noted by Boqun Feng:

1. CPU0 calls rcu_barrier(), starting a new barrier sequence by invoking
rcu_seq_start() and init_completion(), but does not yet initialize
rcu_state.barrier_cpu_count.

2. CPU1 starts offlining for CPU2, calling rcutree_migrate_callbacks(),
which in turn calls rcu_barrier_entrain() holding CPU2's.
rdp->barrier_lock. It then entrains ->barrier_head for CPU2
and atomically increments rcu_state.barrier_cpu_count, which is
unfortunately not yet initialized to the value 2.

3. The just-entrained RCU callback is invoked. It atomically
decrements rcu_state.barrier_cpu_count and sees that it is
now zero. This callback therefore invokes complete().

4. CPU0 continues executing rcu_barrier(), but is not blocked
by its call to wait_for_completion(). This results in rcu_barrier()
returning before all pre-existing callbacks have been invoked,
which is a bug.

Therefore, synchronization is provided by rcu_state.barrier_lock,
which is also held across the initialization sequence, especially the
rcu_seq_start() and the atomic_set() that sets rcu_state.barrier_cpu_count
to the value 2. In addition, this lock is held when entraining the
rcu_barrier() callback, when deciding whether or not a CPU has callbacks
that rcu_barrier() must wait on, when setting the ->qsmaskinitnext for
incoming CPUs, and when migrating callbacks from a CPU that is going
offline.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a16578dd 14-Dec-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Rework rcu_barrier() and callback-migration logic

This commit reworks rcu_barrier() and callback-migration logic to
permit allowing rcu_barrier() to run concurrently with CPU-hotplug
operations. The key trick is for callback migration to check to see if
an rcu_barrier() is in flight, and, if so, enqueue the ->barrier_head
callback on its behalf.

This commit adds synchronization with RCU's CPU-hotplug notifiers. Taken
together, this will permit a later commit to remove the cpus_read_lock()
and cpus_read_unlock() calls from rcu_barrier().

[ paulmck: Updated per kbuild test robot feedback. ]
[ paulmck: Updated per reviews session with Neeraj, Frederic, Uladzislau, and Boqun. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 0cabb47a 10-Dec-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Refactor rcu_barrier() empty-list handling

This commit saves a few lines by checking first for an empty callback
list. If the callback list is empty, then that CPU is taken care of,
regardless of its online or nocb state. Also simplify tracing accordingly
and fold a few lines together.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 82980b16 16-Feb-2021 David Woodhouse <dwmw@amazon.co.uk>

rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion

If we allow architectures to bring APs online in parallel, then we end
up requiring rcu_cpu_starting() to be reentrant. But currently, the
manipulation of rnp->ofl_seq is not thread-safe.

However, rnp->ofl_seq is also fairly much pointless anyway since both
rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
fairly much the whole time that rnp->ofl_seq is set to an odd number
to indicate that an operation is in progress.

So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.

This has a couple of minor complexities: lockdep will complain when we
take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
avoid that false positive complaint. Since we're killing rnp->ofl_seq
of course that 'excuse' has to be changed too, so make it check for
arch_spin_is_locked(rcu_state.ofl_lock).

There's no arch_spin_lock_irqsave() so we have to manually save and
restore local interrupts around the locking.

At Paul's request based on Neeraj's analysis, make rcu_gp_init not just
wait but *exclude* any CPU online/offline activity, which was fairly
much true already by virtue of it holding rcu_state.ofl_lock.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c9515875 24-Jan-2022 Zqiang <qiang1.zhang@intel.com>

rcu: Add per-CPU rcuc task dumps to RCU CPU stall warnings

When the rcutree.use_softirq kernel boot parameter is set to zero, all
RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads.
If these kthreads are being starved, quiescent states will not be
reported, which in turn means that the grace period will not end, which
can in turn trigger RCU CPU stall warnings. This commit therefore dumps
stack traces of stalled CPUs' rcuc kthreads, which can help identify
what is preventing those kthreads from running.

Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c8b16a65 11-Jan-2022 Alison Chaiken <achaiken@aurora.tech>

rcu: Elevate priority of offloaded callback threads

When CONFIG_PREEMPT_RT=y, the rcutree.kthread_prio command-line
parameter signals initialization code to boost the priority of rcuc
callbacks to the designated value. With the additional
CONFIG_RCU_NOCB_CPU=y configuration and an additional rcu_nocbs
command-line parameter, the callbacks on the listed cores are
offloaded to new rcuop kthreads that are not pinned to the cores whose
post-grace-period work is performed. While the rcuop kthreads perform
the same function as the rcuc kthreads they offload, the kthread_prio
parameter only boosts the priority of the rcuc kthreads. Fix this
inconsistency by elevating rcuop kthreads to the same priority as the rcuc
kthreads.

Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c8db27dd 11-Jan-2022 Alison Chaiken <achaiken@aurora.tech>

rcu: Move kthread_prio bounds-check to a separate function

Move the bounds-check of the kthread_prio cmdline parameter to a new
function in order to faciliate a different callsite.

Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4b4399b2 28-Dec-2021 Zqiang <qiang1.zhang@intel.com>

rcu: Create per-cpu rcuc kthreads only when rcutree.use_softirq=0

The per-CPU "rcuc" kthreads are used only by kernels booted with
rcutree.use_softirq=0, but they are nevertheless unconditionally created
by kernels built with CONFIG_RCU_BOOST=y. This results in "rcuc"
kthreads being created that are never actually used. This commit
therefore refrains from creating these kthreads unless the kernel
is actually booted with rcutree.use_softirq=0.

Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 0598a4d4 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Don't invoke local rcu core on callback overload from nocb kthread

rcu_core() tries to ensure that its self-invocation in case of callbacks
overload only happen in softirq/rcuc mode. Indeed it doesn't make sense
to trigger local RCU core from nocb_cb kthread since it can execute
on a CPU different from the target rdp. Also in case of overload, the
nocb_cb kthread simply iterates a new loop of callbacks processing.

However the "offloaded" check that aims at preventing misplaced
rcu_core() invocations is wrong. First of all that state is volatile
and second: softirq/rcuc can execute while the target rdp is offloaded.
As a result rcu_core() can be invoked on the wrong CPU while in the
process of (de-)offloading.

Fix that with moving the rcu_core() self-invocation to rcu_core() itself,
irrespective of the rdp offloaded state.

Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a554ba28 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu: Apply callbacks processing time limit only on softirq

Time limit only makes sense when callbacks are serviced in softirq mode
because:

_ In case we need to get back to the scheduler,
cond_resched_tasks_rcu_qs() is called after each callback.

_ In case some other softirq vector needs the CPU, the call to
local_bh_enable() before cond_resched_tasks_rcu_qs() takes care about
them via a call to do_softirq().

Therefore, make sure the time limit only applies to softirq mode.

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3e61e95e 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu: Fix callbacks processing time limit retaining cond_resched()

The callbacks processing time limit makes sure we are not exceeding a
given amount of time executing the queue.

However its "continue" clause bypasses the cond_resched() call on
rcuc and NOCB kthreads, delaying it until we reach the limit, which can
be very long...

Make sure the scheduler has a higher priority than the time limit.

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 78ad37a2 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Limit number of softirq callbacks only on softirq

The current condition to limit the number of callbacks executed in a
row checks the offloaded state of the rdp. Not only is it volatile
but it is also misleading: the rcu_core() may well be executing
callbacks concurrently with NOCB kthreads, and the offloaded state
would then be verified on both cases. As a result the limit would
spuriously not apply anymore on softirq while in the middle of
(de-)offloading process.

Fix and clarify the condition with those constraints in mind:

_ If callbacks are processed either by rcuc or NOCB kthread, the call
to cond_resched_tasks_rcu_qs() is enough to take care of the overload.

_ If instead callbacks are processed by softirqs:
* If need_resched(), exit the callbacks processing
* Otherwise if CPU is idle we can continue
* Otherwise exit because a softirq shouldn't interrupt a task for too
long nor deprive other pending softirq vectors of the CPU.

Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7b65dfa3 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Use appropriate rcu_nocb_lock_irqsave()

Instead of hardcoding IRQ save and nocb lock, use the consolidated
API (and fix a comment as per Valentin Schneider's suggestion).

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 344e219d 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Check a stable offloaded state to manipulate qlen_last_fqs_check

It's not entirely obvious why rdp->qlen_last_fqs_check is updated before
processing the queue only on offloaded rdp. There can be different
effect to that, either in favour of triggering the force quiescent state
path or not. For example:

1) If the number of callbacks has decreased since the last
rdp->qlen_last_fqs_check update (because we recently called
rcu_do_batch() and we executed below qhimark callbacks) and the number
of processed callbacks on a subsequent do_batch() arranges for
exceeding qhimark on non-offloaded but not on offloaded setup, then we
may spare a later run to the force quiescent state
slow path on __call_rcu_nocb_wake(), as compared to the non-offloaded
counterpart scenario.

Here is such an offloaded scenario instance:

qhimark = 1000
rdp->last_qlen_last_fqs_check = 3000
rcu_segcblist_n_cbs(rdp) = 2000

rcu_do_batch() {
if (offloaded)
rdp->last_qlen_fqs_check = rcu_segcblist_n_cbs(rdp) // 2000
// run 1000 callback
rcu_segcblist_n_cbs(rdp) = 1000
// Not updating rdp->qlen_last_fqs_check
if (count < rdp->qlen_last_fqs_check - qhimark)
rdp->qlen_last_fqs_check = count;
}

call_rcu() * 1001 {
__call_rcu_nocb_wake() {
// not taking the fqs slowpath:
// rcu_segcblist_n_cbs(rdp) == 2001
// rdp->qlen_last_fqs_check == 2000
// qhimark == 1000
if (len > rdp->qlen_last_fqs_check + qhimark)
...
}

In the case of a non-offloaded scenario, rdp->qlen_last_fqs_check
would be 1000 and the fqs slowpath would have executed.

2) If the number of callbacks has increased since the last
rdp->qlen_last_fqs_check update (because we recently queued below
qhimark callbacks) and the number of callbacks executed in rcu_do_batch()
doesn't exceed qhimark for either offloaded or non-offloaded setup,
then it's possible that the offloaded scenario later run the force
quiescent state slow path on __call_rcu_nocb_wake() while the
non-offloaded doesn't.

qhimark = 1000
rdp->last_qlen_last_fqs_check = 3000
rcu_segcblist_n_cbs(rdp) = 2000

rcu_do_batch() {
if (offloaded)
rdp->last_qlen_last_fqs_check = rcu_segcblist_n_cbs(rdp) // 2000
// run 100 callbacks
// concurrent queued 100
rcu_segcblist_n_cbs(rdp) = 2000
// Not updating rdp->qlen_last_fqs_check
if (count < rdp->qlen_last_fqs_check - qhimark)
rdp->qlen_last_fqs_check = count;
}

call_rcu() * 1001 {
__call_rcu_nocb_wake() {
// Taking the fqs slowpath:
// rcu_segcblist_n_cbs(rdp) == 3001
// rdp->qlen_last_fqs_check == 2000
// qhimark == 1000
if (len > rdp->qlen_last_fqs_check + qhimark)
...
}

In the case of a non-offloaded scenario, rdp->qlen_last_fqs_check
would be 3000 and the fqs slowpath would have executed.

The reason for updating rdp->qlen_last_fqs_check when invoking callbacks
for offloaded CPUs is that there is usually no point in waking up either
the rcuog or rcuoc kthreads while in this state. After all, both threads
are prohibited from indefinite sleeps.

The exception is when some huge number of callbacks are enqueued while
rcu_do_batch() is in the midst of invoking, in which case interrupting
the rcuog kthread's timed sleep might get more callbacks set up for the
next grace period.

Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Original-patch-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b3bb02fe 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Make rcu_core() callbacks acceleration (de-)offloading safe

When callbacks are offloaded, the NOCB kthreads handle the callbacks
progression on behalf of rcu_core().

However during the (de-)offloading process, the kthread may not be
entirely up to the task. As a result some callbacks grace period
sequence number may remain stale for a while because rcu_core() won't
take care of them either.

Fix this with forcing callbacks acceleration from rcu_core() as long
as the offloading process isn't complete.

Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 24ee940d 18-Oct-2021 Thomas Gleixner <tglx@linutronix.de>

rcu/nocb: Make rcu_core() callbacks acceleration preempt-safe

While reporting a quiescent state for a given CPU, rcu_core() takes
advantage of the freshly loaded grace period sequence number and the
locked rnp to accelerate the callbacks whose sequence number have been
assigned a stale value.

This action is only necessary when the rdp isn't offloaded, otherwise
the NOCB kthreads already take care of the callbacks progression.

However the check for the offloaded state is volatile because it is
performed outside the IRQs disabled section. It's possible for the
offloading process to preempt rcu_core() at that point on PREEMPT_RT.

This is dangerous because rcu_core() may end up accelerating callbacks
concurrently with NOCB kthreads without appropriate locking.

Fix this with moving the offloaded check inside the rnp locking section.

Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# fbb94cbd 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Invoke rcu_core() at the start of deoffloading

On PREEMPT_RT, if rcu_core() is preempted by the de-offloading process,
some work, such as callbacks acceleration and invocation, may be left
unattended due to the volatile checks on the offloaded state.

In the worst case this work is postponed until the next rcu_pending()
check that can take a jiffy to reach, which can be a problem in case
of callbacks flooding.

Solve that with invoking rcu_core() early in the de-offloading process.
This way any work dismissed by an ongoing rcu_core() call fooled by
a preempting deoffloading process will be caught up by a nearby future
recall to rcu_core(), this time fully aware of the de-offloading state.

Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 213d56bf 18-Oct-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Prepare state machine for a new step

Currently SEGCBLIST_SOFTIRQ_ONLY is a bit of an exception among the
segcblist flags because it is an exclusive state that doesn't mix up
with the other flags. Remove it in favour of:

_ A flag specifying that rcu_core() needs to perform callbacks execution
and acceleration

and

_ A flag specifying we want the nocb lock to be held in any needed
circumstances

This clarifies the code and is more flexible: It allows to have a state
where rcu_core() runs with locking while offloading hasn't started yet.
This is a necessary step to prepare for triggering rcu_core() at the
very beginning of the de-offloading process so that rcu_core() won't
dismiss work while being preempted by the de-offloading process, at
least not without a pending subsequent rcu_core() that will quickly
catch up.

Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 614ddad1 17-Sep-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Tighten rcu_advance_cbs_nowake() checks

Currently, rcu_advance_cbs_nowake() checks that a grace period is in
progress, however, that grace period could end just after the check.
This commit rechecks that a grace period is still in progress while
holding the rcu_node structure's lock. The grace period cannot end while
the current CPU's rcu_node structure's ->lock is held, thus avoiding
false positives from the WARN_ON_ONCE().

As Daniel Vacek noted, it is not necessary for the rcu_node structure
to have a CPU that has not yet passed through its quiescent state.

Tested-by: Guillaume Morin <guillaume@morinfr.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 790da248 29-Sep-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Make idle entry report expedited quiescent states

In non-preemptible kernels, an unfortunately timed expedited grace period
can result in the rcu_exp_handler() IPI handler setting the rcu_data
structure's cpu_no_qs.b.exp field just as the target CPU enters idle.
There are situations in which this field will not be checked until after
that CPU exits idle. The resulting grace-period latency does not qualify
as "expedited".

This commit therefore checks this field upon non-preemptible idle entry in
the rcu_preempt_deferred_qs() function. It also qualifies the rcu_core()
preempt_count() check with IS_ENABLED(CONFIG_PREEMPT_COUNT) to prevent
false-positive quiescent states from count-free kernels.

Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 300c0c5e 15-Nov-2021 Jun Miao <jun.miao@intel.com>

rcu: Avoid alloc_pages() when recording stack

The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT,
which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however,
it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db216c9c7).
Fix it by instructing stackdepot to not expand stack storage via alloc_pages()
in case it runs out by using kasan_record_aux_stack_noalloc().

Jianwei Hu reported:
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
softirqs last enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 6 PID: 15319 Comm: python3 Tainted: G W O 5.15-rc7-preempt-rt #1
Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018
Call Trace:
show_stack+0x52/0x58
dump_stack+0xa1/0xd6
___might_sleep.cold+0x11c/0x12d
rt_spin_lock+0x3f/0xc0
rmqueue+0x100/0x1460
rmqueue+0x100/0x1460
mark_usage+0x1a0/0x1a0
ftrace_graph_ret_addr+0x2a/0xb0
rmqueue_pcplist.constprop.0+0x6a0/0x6a0
__kasan_check_read+0x11/0x20
__zone_watermark_ok+0x114/0x270
get_page_from_freelist+0x148/0x630
is_module_text_address+0x32/0xa0
__alloc_pages_nodemask+0x2f6/0x790
__alloc_pages_slowpath.constprop.0+0x12d0/0x12d0
create_prof_cpu_mask+0x30/0x30
alloc_pages_current+0xb1/0x150
stack_depot_save+0x39f/0x490
kasan_save_stack+0x42/0x50
kasan_save_stack+0x23/0x50
kasan_record_aux_stack+0xa9/0xc0
__call_rcu+0xff/0x9c0
call_rcu+0xe/0x10
put_object+0x53/0x70
__delete_object+0x7b/0x90
kmemleak_free+0x46/0x70
slab_free_freelist_hook+0xb4/0x160
kfree+0xe5/0x420
kfree_const+0x17/0x30
kobject_cleanup+0xaa/0x230
kobject_put+0x76/0x90
netdev_queue_update_kobjects+0x17d/0x1f0
... ...
ksys_write+0xd9/0x180
__x64_sys_write+0x42/0x50
do_syscall_64+0x38/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642
Fixes: 84109ab58590 ("rcu: Record kvfree_call_rcu() call stack for KASAN")
Fixes: 26e760c9a7c8 ("rcu: kasan: record and print call_rcu() call stack")
Reported-by: Jianwei Hu <jianwei.hu@windriver.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Jun Miao <jun.miao@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2407a64f 27-Sep-2021 Changbin Du <changbin.du@intel.com>

rcu: in_irq() cleanup

This commit replaces the obsolete and ambiguous macro in_irq() with its
shiny new in_hardirq() equivalent.

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# bc849e91 27-Sep-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_needs_cpu() to tree.c

Now that RCU_FAST_NO_HZ is no more, there is but one implementation of
the rcu_needs_cpu() function. This commit therefore moves this function
from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e2c73a68 27-Sep-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove the RCU_FAST_NO_HZ Kconfig option

All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve
systems with RCU callbacks offloaded. In this situation, all that this
Kconfig option does is slow down idle entry/exit with an additional
allways-taken early exit. If this is the only use case, then this
Kconfig option nothing but an attractive nuisance that needs to go away.

This commit therefore removes the RCU_FAST_NO_HZ Kconfig option.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 74aece72 28-Sep-2021 Peter Zijlstra <peterz@infradead.org>

rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr

vmlinux.o: warning: objtool: rcu_nmi_enter()+0x36: call to __kasan_check_read() leaves .noinstr.text section

noinstr cannot have atomic_*() functions in because they're explicitly
annotated, use arch_atomic_*().

Fixes: 2be57f732889 ("rcu: Weaken ->dynticks accesses and updates")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4aa846f9 29-Jul-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcutree_dying_cpu() use its "cpu" parameter

The CPU-hotplug functions take a "cpu" parameter, but rcutree_dying_cpu()
ignores it in favor of this_cpu_ptr(). This works at the moment, but
it would be better to be consistent. This might also work better given
some possible future changes. This commit therefore uses per_cpu_ptr()
to avoid ignoring the rcutree_dying_cpu() function's argument.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 768f5d50 29-Jul-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Simplify rcu_report_dead() call to rcu_report_exp_rdp()

Currently, rcu_report_dead() disables preemption across its call to
rcu_report_exp_rdp(), but this is pointless because interrupts are
already disabled by the caller. In addition, rcu_report_dead() computes
the address of the outgoing CPU's rcu_data structure, which is also
pointless because this address is already present in local variable rdp.
This commit therefore drops the preemption disabling and passes rdp
to rcu_report_exp_rdp().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2caebefb 28-Jul-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()

The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
counter of an incoming CPU when required. It is currently invoked
from rcutree_prepare_cpu(), which runs before the incoming CPU is
running, and thus on some other CPU. This makes the per-CPU accesses in
rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
the running CPU cannot possibly be in dyntick-idle mode, which means
that rcu_dynticks_eqs_online() never has any effect.

It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
only because the CPU-offline process just happens to leave ->dynticks in
the correct state. After all, if ->dynticks were in the wrong state on a
just-onlined CPU, rcutorture would complain bitterly the next time that
CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
for example, those built by rcutorture scenario TREE04. One could
argue that this means that rcu_dynticks_eqs_online() is unnecessary,
however, removing it would make the CPU-online process vulnerable to
slight changes in the CPU-offline process.

One could also ask why it is safe to move the rcu_dynticks_eqs_online()
call so late in the CPU-online process. Indeed, there was a time when it
would not have been safe, which does much to explain its current location.
However, the marking of a CPU as online from an RCU perspective has long
since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
that is required is that ->dynticks be set correctly by the time that
the CPU is marked as online from an RCU perspective. After all, the RCU
grace-period kthread does not check to see if offline CPUs are also idle.
(In case you were curious, this is one reason why there is quiescent-state
reporting as part of the offlining process.)

This commit therefore moves the call to rcu_dynticks_eqs_online() from
rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
to be running on the incoming CPU. The call to this function must of
course be placed before this rcu_cpu_starting() announces this CPU's
presence to RCU.

Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ebc88ad4 26-Jul-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Comment rcu_gp_init() code waiting for CPU-hotplug operations

Near the beginning of rcu_gp_init() is a per-rcu_node loop that waits
for CPU-hotplug operations that might have started before the new
grace period did. This commit adds a comment explaining that this
wait does not exclude CPU-hotplug operations.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 9424b867 22-Jul-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate rcu_implicit_dynticks_qs() local variable ruqp

The rcu_implicit_dynticks_qs() function's local variable ruqp references
the ->rcu_urgent_qs field in the rcu_data structure referenced by the
function parameter rdp, with a rather odd method for computing the
pointer to this field. This commit therefore simplifies things and
saves a couple of lines of code by replacing each instance of ruqp with
&rdp->need_heavy_qs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 88ee23ef 22-Jul-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate rcu_implicit_dynticks_qs() local variable rnhqp

The rcu_implicit_dynticks_qs() function's local variable rnhqp references
the ->rcu_need_heavy_qs field in the rcu_data structure referenced by
the function parameter rdp, with a rather odd method for computing
the pointer to this field. This commit therefore simplifies things
and saves a few lines of code by replacing each instance of rnhqp with
&rdp->need_heavy_qs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2431774f 20-Jul-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Mark accesses to rcu_state.n_force_qs

This commit marks accesses to the rcu_state.n_force_qs. These data
races are hard to make happen, but syzkaller was equal to the task.

Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d3dd95a8 03-Aug-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

rcu: Replace deprecated CPU-hotplug functions

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8211e922 30-Jun-2021 Liu Song <liu.song11@zte.com.cn>

rcu: Use per_cpu_ptr to get the pointer of per_cpu variable

There are a few remaining locations in kernel/rcu that still use
"&per_cpu()". This commit replaces them with "per_cpu_ptr(&)", and does
not introduce any functional change.

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# eb880949 29-Jun-2021 Liu Song <liu.song11@zte.com.cn>

rcu: Remove useless "ret" update in rcu_gp_fqs_loop()

Within rcu_gp_fqs_loop(), the "ret" local variable is set to the
return value from swait_event_idle_timeout_exclusive(), but "ret" is
unconditionally overwritten later in the code. This commit therefore
removes this useless assignment.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f74126dc 07-Jun-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack

The kbuild test project found an oversized stack frame in rcu_gp_kthread()
for some kernel configurations. This oversizing was due to a very large
amount of inlining, which is unnecessary due to the fact that this code
executes infrequently. This commit therefore marks rcu_gp_init() and
rcu_gp_fqs_loop noinline_for_stack to conserve stack space.

Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Rong Chen <rong.a.chen@intel.com>
[ paulmck: noinline_for_stack per Nathan Chancellor. ]
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2be57f73 19-May-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Weaken ->dynticks accesses and updates

Accesses to the rcu_data structure's ->dynticks field have always been
fully ordered because it was not possible to prove that weaker ordering
was safe. However, with the removal of the rcu_eqs_special_set() function
and the advent of the Linux-kernel memory model, it is now easy to show
that two of the four original full memory barriers can be weakened to
acquire and release operations. The remaining pair must remain full
memory barriers. This change makes the memory ordering requirements
more evident, and it might well also speed up the to-idle and from-idle
fastpaths on some architectures.

The following litmus test, adapted from one supplied off-list by Frederic
Weisbecker, models the RCU grace-period kthread detecting an idle CPU
that is concurrently transitioning to non-idle:

C dynticks-from-idle

{
DYNTICKS=0; (* Initially idle. *)
}

P0(int *X, int *DYNTICKS)
{
int dynticks;
int x;

// Idle.
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now non-idle
x = READ_ONCE(*X);
}

P1(int *X, int *DYNTICKS)
{
int dynticks;

WRITE_ONCE(*X, 1);
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
}

exists (1:dynticks=0 /\ 0:x=1)

Running "herd7 -conf linux-kernel.cfg dynticks-from-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread (P1)
sees another CPU as idle (P0), then any memory access prior to the start
of the grace period (P1's write to X) will be seen by any RCU read-side
critical section following the to-non-idle transition (P0's read from X).
This is a straightforward use of full memory barriers to force ordering
in a store-buffering (SB) litmus test.

The following litmus test, also adapted from the one supplied off-list
by Frederic Weisbecker, models the RCU grace-period kthread detecting
a non-idle CPU that is concurrently transitioning to idle:

C dynticks-into-idle

{
DYNTICKS=1; (* Initially non-idle. *)
}

P0(int *X, int *DYNTICKS)
{
int dynticks;

// Non-idle.
WRITE_ONCE(*X, 1);
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now idle.
}

P1(int *X, int *DYNTICKS)
{
int x;
int dynticks;

smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
x = READ_ONCE(*X);
}

exists (1:dynticks=2 /\ 1:x=0)

Running "herd7 -conf linux-kernel.cfg dynticks-into-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread
(P1) sees another CPU as newly idle (P0), then any pre-idle memory access
(P0's write to X) will be seen by any code following the grace period
(P1's read from X). This is a simple release-acquire pair forcing
ordering in a message-passing (MP) litmus test.

Of course, if the grace-period kthread detects the CPU as non-idle,
it will refrain from reporting a quiescent state on behalf of that CPU,
so there are no ordering requirements from the grace-period kthread in
that case. However, other subsystems call rcu_is_idle_cpu() to check
for CPUs being non-idle from an RCU perspective. That case is also
verified by the above litmus tests with the proviso that the sense of
the low-order bit of the DYNTICKS counter be inverted.

Unfortunately, on x86 smp_mb() is as expensive as a cache-local atomic
increment. This commit therefore weakens only the read from ->dynticks.
However, the updates are abstracted into a rcu_dynticks_inc() function
to ease any future changes that might be needed.

[ paulmck: Apply Linus Torvalds feedback. ]

Link: https://lore.kernel.org/lkml/20210721202127.2129660-4-paulmck@kernel.org/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a86baa69 18-May-2021 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Remove special bit at the bottom of the ->dynticks counter

Commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of ->dynticks
counter") reserved a bit at the bottom of the ->dynticks counter to defer
flushing of TLBs, but this facility never has been used. This commit
therefore removes this capability along with the rcu_eqs_special_set()
function used to trigger it.

Link: https://lore.kernel.org/linux-doc/CALCETrWNPOOdTrFabTDd=H7+wc6xJ9rJceg6OL1S0rTV5pfSsA@mail.gmail.com/
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
[ paulmck: Forward-port to v5.13-rc1. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# cba712be 18-May-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Remove NOCB deferred wakeup from rcutree_dead_cpu()

At CPU offline time, we must handle any pending wakeup for the nocb_gp
kthread linked to the outgoing CPU.

Now we are making sure of that twice:

1) From rcu_report_dead() when the outgoing CPU makes the very last
local cleanups by itself before switching offline.

2) From rcutree_dead_cpu(). Here the offlining CPU has gone and is truly
now offline. Another CPU takes care of post-portem cleaning up and
check if the offline CPU had pending wakeup.

Both ways are fine but we have to choose one or the other because we
don't need to repeat that action. Simply benefit from cache locality
and keep only the first solution.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# dfcb2754 18-May-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Start moving nocb code to its own plugin file

The kernel/rcu/tree_plugin.h file contains not only the plugins for
preemptible RCU, but also many other features including rcu_nocbs
callback offloading. This offloading has become large and complex,
so it is time to put it in its own file.

This commit starts that process.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Rename to tree_nocb.h, add Frederic as author. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# cf868c2a 24-Mar-2021 Paul E. McKenney <paulmck@kernel.org>

rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states

Heavy networking load can cause a CPU to execute continuously and
indefinitely within ksoftirqd, in which case there will be no voluntary
task switches and thus no RCU-tasks quiescent states. This commit
therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks
quiescent state.

This of course means that __do_softirq() and its callers cannot be
invoked from within a tracing trampoline.

Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>


# 1893afd6 29-Apr-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Improve comments describing RCU read-side critical sections

There are a number of places that call out the fact that preempt-disable
regions of code now act as RCU read-side critical sections, where
preempt-disable regions of code include irq-disable regions of code,
bh-disable regions of code, hardirq handlers, and NMI handlers. However,
someone relying solely on (for example) the call_rcu() header comment
might well have no idea that preempt-disable regions of code have RCU
semantics.

This commit therefore updates the header comments for
call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
rcu_dereference_sched_check() to call out these new(ish) forms of RCU
readers.

Reported-by: Michel Lespinasse <michel@lespinasse.org>
[ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a616aec9 22-Mar-2021 Ingo Molnar <mingo@kernel.org>

rcu: Fix various typos in comments

Fix ~12 single-word typos in RCU code comments.

[ paulmck: Apply feedback from Randy Dunlap. ]
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 87090516 22-Feb-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Prepare for fine-grained deferred wakeup

Tuning the deferred wakeup level must be done from a safe wakeup
point. Currently those sites are:

* ->nocb_timer
* user/idle/guest entry
* CPU down
* softirq/rcuc

All of these sites perform the wake up for both RCU_NOCB_WAKE and
RCU_NOCB_WAKE_FORCE.

In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan
to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until
a timer fires so that we don't wake up the NOCB-gp kthread too early.

To prepare for that, this commit specifies the per-callsite wakeup
level/limit.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3d3a0d1b 16-Apr-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Point to documentation of ordering guarantees

Add comments to synchronize_rcu() and friends that point to
Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2f20de99 11-Apr-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_gp_cleanup() be noinline for tracing

Although there are trace events for RCU grace periods, these are only
enabled in CONFIG_RCU_TRACE=y kernels. This commit therefore marks
rcu_gp_cleanup() noinline in order to provide a function that can be
traced that is invoked near the end of each grace period.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3ef5a1c3 05-Apr-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Make RCU priority boosting work on single-CPU rcu_node structures

When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one. Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread. Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure. There is no point in checking because
we know there is a CPU on its way in.

The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time. Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.

This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate. In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.

With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8e4b1d2b 31-Mar-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()

Currently, rcu_spawn_core_kthreads() is invoked via an early_initcall(),
which works, except that rcu_spawn_gp_kthread() is also invoked via an
early_initcall() and rcu_spawn_core_kthreads() relies on adjustments to
kthread_prio that are carried out by rcu_spawn_gp_kthread(). There is
no guaranttee of ordering among early_initcall() handlers, and thus no
guarantee that kthread_prio will be properly checked and range-limited
at the time that rcu_spawn_core_kthreads() needs it.

In most cases, this bug is harmless. After all, the only reason that
rcu_spawn_gp_kthread() adjusts the value of kthread_prio is if the user
specified a nonsensical value for this boot parameter, which experience
indicates is rare.

Nevertheless, a bug is a bug. This commit therefore causes the
rcu_spawn_core_kthreads() function to be invoked directly from
rcu_spawn_gp_kthread() after any needed adjustments to kthread_prio have
been carried out.

Fixes: 48d07c04b4cc ("rcu: Enable elimination of Tree-RCU softirq processing")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 277ffe1b 30-Mar-2021 Zhouyi Zhou <zhouzhouyi@gmail.com>

rcu: Improve tree.c comments and add code cleanups

This commit cleans up some comments and code in kernel/rcu/tree.c.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ce7c169d 30-Mar-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove the unused rcu_irq_exit_preempt() function

Commit 9ee01e0f69a9 ("x86/entry: Clean up idtentry_enter/exit()
leftovers") left the rcu_irq_exit_preempt() in place in order to avoid
conflicts with the -rcu tree. Now that this change has long since hit
mainline, this commit removes the no-longer-used rcu_irq_exit_preempt()
function.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b5befe84 17-Apr-2021 Frederic Weisbecker <frederic@kernel.org>

srcu: Fix broken node geometry after early ssp init

An srcu_struct structure that is initialized before rcu_init_geometry()
will have its srcu_node hierarchy based on CONFIG_NR_CPUS. Once
rcu_init_geometry() is called, this hierarchy is compressed as needed
for the actual maximum number of CPUs for this system.

Later on, that srcu_struct structure is confused, sometimes referring
to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead
to the new num_possible_cpus() hierarchy. For example, each of its
->mynode fields continues to reference the original leaf rcu_node
structures, some of which might no longer exist. On the other hand,
srcu_for_each_node_breadth_first() traverses to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an srcu_data structure (call it
*sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in
srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf
(say 3 levels).

b) The grace period ends after rcu_init_geometry() shrinks the
nodes level to a single one. srcu_gp_end() walks through the new
srcu_node hierarchy without ever reaching the old leaves so the
callback is never executed.

This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32
and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests
verification never completes and the boot hangs:

[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds.
[ 5413.147564] Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5413.159753] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000
[ 5413.168099] Call Trace:
[ 5413.170555] __schedule+0x36c/0x930
[ 5413.174057] ? wait_for_completion+0x88/0x110
[ 5413.178423] schedule+0x46/0xf0
[ 5413.181575] schedule_timeout+0x284/0x380
[ 5413.185591] ? wait_for_completion+0x88/0x110
[ 5413.189957] ? mark_held_locks+0x61/0x80
[ 5413.193882] ? mark_held_locks+0x61/0x80
[ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173] ? wait_for_completion+0x88/0x110
[ 5413.206535] wait_for_completion+0xb4/0x110
[ 5413.210724] ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610] srcu_barrier+0x187/0x200
[ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244] ? rdinit_setup+0x2b/0x2b
[ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700] do_one_initcall+0x63/0x310
[ 5413.236541] ? rdinit_setup+0x2b/0x2b
[ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912] kernel_init_freeable+0x253/0x28f
[ 5413.249273] ? rest_init+0x250/0x250
[ 5413.252846] kernel_init+0xa/0x110
[ 5413.256257] ret_from_fork+0x22/0x30

2) An srcu_struct structure that is initialized before rcu_init_geometry()
and used afterward will always have stale rdp->mynode references,
resulting in callbacks to be missed in srcu_gp_end(), just like in
the previous scenario.

This commit therefore causes init_srcu_struct_nodes to initialize the
geometry, if needed. This ensures that the srcu_node hierarchy is
properly built and distributed from the get-go.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8e9c01c7 08-Apr-2021 Frederic Weisbecker <frederic@kernel.org>

srcu: Initialize SRCU after timers

Once srcu_init() is called, the SRCU core will make use of delayed
workqueues, which rely on timers. However init_timers() is called
several steps after rcu_init(). This means that a call_srcu() after
rcu_init() but before init_timers() would find itself within a dangerously
uninitialized timer core.

This commit therefore creates a separate call to srcu_init() after
init_timer() completes, which ensures that we stay in early SRCU mode
until timers are safe(r).

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a78d4a2a 21-Apr-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Refactor kfree_rcu_monitor()

Currently we have three functions which depend on each other.
Two of them are quite tiny and the last one where the most
work is done. All of them are related to queuing RCU batches
to reclaim objects after a GP.

1. kfree_rcu_monitor(). It consist of few lines. It acquires a spin-lock
and calls kfree_rcu_drain_unlock().

2. kfree_rcu_drain_unlock(). It also consists of few lines of code. It
calls queue_kfree_rcu_work() to queue the batch. If this fails,
it rearms the monitor work to try again later.

3. queue_kfree_rcu_work(). This provides the bulk of the functionality,
attempting to start a new batch to free objects after a GP.

Since there are no external users of functions [2] and [3], both
can eliminated by moving all logic directly into [1], which both
shrinks and simplifies the code.

Also replace comments which start with "/*" to "//" format to make it
unified across the file.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d8628f35 28-Apr-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Fix comments according to current code

The kvfree_rcu() function now defers allocations in the common
case due to the fact that there is no lockless access to the
memory-allocator caches/pools. In addition, in CONFIG_PREEMPT_NONE=y
and in CONFIG_PREEMPT_VOLUNTARY=y kernels, there is no reliable way to
determine if spinlocks are held. As a result, allocation is deferred in
the common case, and the two-argument form of kvfree_rcu() thus uses the
"channel 3" queue through all the rcu_head structures. This channel
is called referred to as the emergency case in comments, and these
comments are now obsolete.

This commit therefore updates these comments to reflect the new
common-case nature of such emergencies.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7fe1da33 15-Apr-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Use kfree_rcu_monitor() instead of open-coded variant

Replace an open-coded version of the kfree_rcu_monitor() function body
with a call to that function.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# dd28c9f0 15-Apr-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Update "monitor_todo" once a batch is started

Before attempting to start a new batch the "monitor_todo" variable is
set to "false" and set back to "true" when a previous RCU batch is still
in progress. This is at best confusing.

Thus change this variable to "false" only when a new batch has been
successfully queued, otherwise, just leave it be.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d434c00f 15-Apr-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Add a bulk-list check when a scheduler is run

The rcu_scheduler_active flag is set to RCU_SCHEDULER_RUNNING once the
scheduler is up and running. That signal is used in order to check
and queue a "monitor work" to reclaim freed objects (if there are any)
during early boot. This flag is used by kvfree_rcu() to determine when
work can safely be queued, at which point memory passed to earlier
invocations of kvfree_rcu() can be processed.

However, only "krcp->head" is checked for objects that need to be
released, and there are now two more, namely, "krcp->bkvhead[0]" and
"krcp->bkvhead[1]". Therefore, check these two additional channels.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ac7625eb 15-Apr-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Use [READ/WRITE]_ONCE() macros to access to nr_bkv_objs

nr_bkv_objs is a count of the objects in the kvfree_rcu page cache.
Accessing it requires holding the ->lock. Switch to READ_ONCE() and
WRITE_ONCE() macros to provide lockless access to this counter.
This lockless access is used for the shrinker.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d0bfa8b3 15-Apr-2021 Zhang Qiang <qiang.zhang@windriver.com>

kvfree_rcu: Release a page cache under memory pressure

Add a drain_page_cache() function to drain a per-cpu page cache.
The reason behind of it is a system can run into a low memory
condition, in that case a page shrinker can ask for its users
to free their caches in order to get extra memory available for
other needs in a system.

When a system hits such condition, a page cache is drained for
all CPUs in a system. By default a page cache work is delayed
with 5 seconds interval until a memory pressure disappears, if
needed it can be changed. See a rcu_delay_page_cache_fill_msec
module parameter.

Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f39650de 30-Jun-2021 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

kernel.h: split out panic and oops helpers

kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.

There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain

At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.

[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com

Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7abb18bd 25-Feb-2021 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide polling interfaces for Tree RCU grace periods

There is a need for a non-blocking polling interface for RCU grace
periods, so this commit supplies start_poll_synchronize_rcu() and
poll_state_synchronize_rcu() for this purpose. Note that the existing
get_state_synchronize_rcu() may be used if future grace periods are
inevitable (perhaps due to a later call_rcu() invocation). The new
start_poll_synchronize_rcu() is to be used if future grace periods
might not otherwise happen. Finally, poll_state_synchronize_rcu()
provides a lockless check for a grace period having elapsed since
the corresponding call to either of the get_state_synchronize_rcu()
or start_poll_synchronize_rcu().

As with get_state_synchronize_rcu(), the return value from either
get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
to a later call to either poll_state_synchronize_rcu() or the existing
(might_sleep) cond_synchronize_rcu().

[ paulmck: Remove redundant smp_mb() per Frederic Weisbecker feedback. ]
[ Update poll_state_synchronize_rcu() docbook per Frederic Weisbecker feedback. ]
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ec711bc1 28-Jan-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Only (re-)initialize segcblist when needed on CPU up

At the start of a CPU-hotplug operation, the incoming CPU's callback
list can be in a number of states:

1. Disabled and empty. This is the case when the boot CPU has
not invoked call_rcu(), when a non-boot CPU first comes online,
and when a non-offloaded CPU comes back online. In this case,
it is both necessary and permissible to initialize ->cblist.
Because either the CPU is currently running with interrupts
disabled (boot CPU) or is not yet running at all (other CPUs),
it is not necessary to acquire ->nocb_lock.

In this case, initialization is required.

2. Disabled and non-empty. This cannot occur, because early boot
call_rcu() invocations enable the callback list before enqueuing
their callback.

3. Enabled, whether empty or not. In this case, the callback
list has already been initialized. This case occurs when the
boot CPU has executed an early boot call_rcu() and also when
an offloaded CPU comes back online. In both cases, there is
no need to initialize the callback list: In the boot-CPU case,
the CPU has not (yet) gone offline, and in the offloaded case,
the rcuo kthreads are taking care of business.

Because it is not necessary to initialize the callback list,
it is also not necessary to acquire ->nocb_lock.

Therefore, checking if the segcblist is enabled suffices. This commit
therefore initializes the callback list at rcutree_prepare_cpu() time
only if that list is disabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 64305db2 28-Jan-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Forbid NOCB toggling on offline CPUs

It makes no sense to de-offload an offline CPU because that CPU will never
invoke any remaining callbacks. It also makes little sense to offload an
offline CPU because any pending RCU callbacks were migrated when that CPU
went offline. Yes, it is in theory possible to use a number of tricks
to permit offloading and deoffloading offline CPUs in certain cases, but
in practice it is far better to have the simple and deterministic rule
"Toggling the offload state of an offline CPU is forbidden".

For but one example, consider that an offloaded offline CPU might have
millions of callbacks queued. Best to just say "no".

This commit therefore forbids toggling of the offloaded state of
offline CPUs.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3820b513 11-Nov-2020 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Detect unsafe checks for offloaded rdp

Provide CONFIG_PROVE_RCU sanity checks to ensure we are always reading
the offloaded state of an rdp in a safe and stable way and prevent from
its value to be changed under us. We must either hold the barrier mutex,
the cpu-hotplug lock (read or write) or the nocb lock.
Local non-preemptible reads are also safe. NOCB kthreads and timers have
their own means of synchronization against the offloaded state updaters.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ee6ddf58 29-Jan-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Use same set of GFP flags as does single-argument

Running an rcuscale stress-suite can lead to "Out of memory" of a
system. This can happen under high memory pressure with a small amount
of physical memory.

For example, a KVM test configuration with 64 CPUs and 512 megabytes
can result in OOM when running rcuscale with below parameters:

../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig CONFIG_NR_CPUS=64 \
--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 \
rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make

<snip>
[ 12.054448] kworker/1:1H invoked oom-killer: gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0
[ 12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[ 12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
[ 12.056485] Workqueue: events_highpri fill_page_cache_func
[ 12.056485] Call Trace:
[ 12.056485] dump_stack+0x57/0x6a
[ 12.056485] dump_header+0x4c/0x30a
[ 12.056485] ? del_timer_sync+0x20/0x30
[ 12.056485] out_of_memory.cold.47+0xa/0x7e
[ 12.056485] __alloc_pages_slowpath.constprop.123+0x82f/0xc00
[ 12.056485] __alloc_pages_nodemask+0x289/0x2c0
[ 12.056485] __get_free_pages+0x8/0x30
[ 12.056485] fill_page_cache_func+0x39/0xb0
[ 12.056485] process_one_work+0x1ed/0x3b0
[ 12.056485] ? process_one_work+0x3b0/0x3b0
[ 12.060485] worker_thread+0x28/0x3c0
[ 12.060485] ? process_one_work+0x3b0/0x3b0
[ 12.060485] kthread+0x138/0x160
[ 12.060485] ? kthread_park+0x80/0x80
[ 12.060485] ret_from_fork+0x22/0x30
[ 12.062156] Mem-Info:
[ 12.062350] active_anon:0 inactive_anon:0 isolated_anon:0
[ 12.062350] active_file:0 inactive_file:0 isolated_file:0
[ 12.062350] unevictable:0 dirty:0 writeback:0
[ 12.062350] slab_reclaimable:2797 slab_unreclaimable:80920
[ 12.062350] mapped:1 shmem:2 pagetables:8 bounce:0
[ 12.062350] free:10488 free_pcp:1227 free_cma:0
...
[ 12.101610] Out of memory and no killable processes...
[ 12.102042] Kernel panic - not syncing: System is deadlocked on memory
[ 12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[ 12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
<snip>

Because kvfree_rcu() has a fallback path, memory allocation failure is
not the end of the world. Furthermore, the added overhead of aggressive
GFP settings must be balanced against the overhead of the fallback path,
which is a cache miss for double-argument kvfree_rcu() and a call to
synchronize_rcu() for single-argument kvfree_rcu(). The current choice
of GFP_KERNEL|__GFP_NOWARN can result in longer latencies than a call
to synchronize_rcu(), so less-tenacious GFP flags would be helpful.

Here is the tradeoff that must be balanced:
a) Minimize use of the fallback path,
b) Avoid pushing the system into OOM,
c) Bound allocation latency to that of synchronize_rcu(), and
d) Leave the emergency reserves to use cases lacking fallbacks.

This commit therefore changes GFP flags from GFP_KERNEL|__GFP_NOWARN to
GFP_KERNEL|__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_NOWARN. This combination
leaves the emergency reserves alone and can initiate reclaim, but will
not invoke the OOM killer.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3e7ce7a1 29-Jan-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY

__GFP_RETRY_MAYFAIL can spend quite a bit of time reclaiming, and this
can be wasted effort given that there is a fallback code path in case
memory allocation fails.

__GFP_NORETRY does perform some light-weight reclaim, but it will fail
under OOM conditions, allowing the fallback to be taken as an alternative
to hard-OOMing the system.

There is a four-way tradeoff that must be balanced:
1) Minimize use of the fallback path;
2) Avoid full-up OOM;
3) Do a light-wait allocation request;
4) Avoid dipping into the emergency reserves.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7ffc9ec8 20-Jan-2021 Paul E. McKenney <paulmck@kernel.org>

kvfree_rcu: Make krc_this_cpu_unlock() use raw_spin_unlock_irqrestore()

The krc_this_cpu_unlock() function does a raw_spin_unlock() immediately
followed by a local_irq_restore(). This commit saves a line of code by
merging them into a raw_spin_unlock_irqrestore(). This transformation
also reduces scheduling latency because raw_spin_unlock_irqrestore()
responds immediately to a reschedule request. In contrast,
local_irq_restore() does a scheduling-oblivious enabling of interrupts.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b01b4050 20-Jan-2021 Paul E. McKenney <paulmck@kernel.org>

kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()

This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
carried out by the single-argument variant of kvfree_rcu(), thus avoiding
this can-sleep code path from dipping into the emergency reserves.

Acked-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 148e3731 20-Jan-2021 Uladzislau Rezki (Sony) <urezki@gmail.com>

kvfree_rcu: Directly allocate page for single-argument case

Single-argument kvfree_rcu() must be invoked from sleepable contexts,
so we can directly allocate pages. Furthermmore, the fallback in case
of page-allocation failure is the high-latency synchronize_rcu(), so it
makes sense to do these page allocations from the fastpath, and even to
permit limited sleeping within the allocator.

This commit therefore allocates if needed on the fastpath using
GFP_KERNEL|__GFP_RETRY_MAYFAIL. This also has the beneficial effect
of leaving kvfree_rcu()'s per-CPU caches to the double-argument variant
of kvfree_rcu(), given that the double-argument variant cannot directly
invoke the allocator.

[ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6494ccb9 10-Jan-2021 Zhouyi Zhou <zhouzhouyi@gmail.com>

rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()

In rcu_nmi_enter(), there is an erroneous instrumentation_end() in the
second branch of the "if" statement. Oddly enough, "objtool check -f
vmlinux.o" fails to complain because it is unable to correctly cover
all cases. Instead, objtool visits the third branch first, which marks
following trace_rcu_dyntick() as visited. This commit therefore removes
the spurious instrumentation_end().

Fixes: 04b25a495bd6 ("rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr")
Reported-by Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 47fcbc8d 11-Jan-2021 Neeraj Upadhyay <neeraju@codeaurora.org>

rcu: Fix CPU-offline trace in rcutree_dying_cpu

The condition in the trace_rcu_grace_period() in rcutree_dying_cpu() is
backwards, so that it uses the string "cpuofl" when the offline CPU is
blocking the current grace period and "cpuofl-bgp" otherwise. Given that
the "-bgp" stands for "blocking grace period", this is at best misleading.
This commit therefore switches these strings in order to correctly trace
whether the outgoing cpu blocks the current grace period.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d3ad5bbc 06-Jan-2021 Frederic Weisbecker <frederic@kernel.org>

rcu: Remove superfluous rdp fetch

Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar<mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4ae7dc97 31-Jan-2021 Frederic Weisbecker <frederic@kernel.org>

entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point

Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point upon resuming to guest mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-6-frederic@kernel.org


# 47b8ff19 31-Jan-2021 Frederic Weisbecker <frederic@kernel.org>

entry: Explicitly flush pending rcuog wakeup before last rescheduling point

Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point on resuming to user mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-5-frederic@kernel.org


# f8bb5cae 31-Jan-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Trigger self-IPI on late deferred wake up before user resume

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Unfortunately the call to rcu_user_enter() is already past the last
rescheduling opportunity before we resume to userspace or to guest mode.
We may escape there with the woken task ignored.

The ultimate resort to fix every callsites is to trigger a self-IPI
(nohz_full depends on arch to implement arch_irq_work_raise()) that will
trigger a reschedule on IRQ tail or guest exit.

Eventually every site that want a saner treatment will need to carefully
place a call to rcu_nocb_flush_deferred_wakeup() before the last explicit
need_resched() check upon resume.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-4-frederic@kernel.org


# 43789ef3 31-Jan-2021 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Perform deferred wake up before last idle's need_resched() check

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.

Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.

Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frederic@kernel.org


# 54b7429e 31-Jan-2021 Frederic Weisbecker <frederic@kernel.org>

rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers

Deferred wakeup of rcuog kthreads upon RCU idle mode entry is going to
be handled differently whether initiated by idle, user or guest. Prepare
with pulling that control up to rcu_eqs_enter() callers.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-2-frederic@kernel.org


# b4b7914a 08-Dec-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Make call_rcu() print mem_dump_obj() info for double-freed callback

The debug-object double-free checks in __call_rcu() print out the
RCU callback function, which is usually sufficient to track down the
double free. However, all uses of things like queue_rcu_work() will
have the same RCU callback function (rcu_work_rcufn() in this case),
so a diagnostic message for a double queue_rcu_work() needs more than
just the callback function.

This commit therefore calls mem_dump_obj() to dump out any additional
available information on the double-freed callback.

Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 683954e5 16-Nov-2020 Neeraj Upadhyay <neeraju@codeaurora.org>

rcu: Check and report missed fqs timer wakeup on RCU stall

For a new grace period request, the RCU GP kthread transitions through
following states:

a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS]

The RCU_GP_WAIT_GPS state is where the GP kthread waits for a request
for a new GP. Once it receives a request (for example, when a new RCU
callback is queued), the GP kthread transitions to RCU_GP_DONE_GPS.

b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF]

Grace period initialization starts in rcu_gp_init(), which records the
start of new GP in rcu_state.gp_seq and transitions to RCU_GP_ONOFF.

c. [RCU_GP_ONOFF] -> [RCU_GP_INIT]

The purpose of the RCU_GP_ONOFF state is to apply the online/offline
information that was buffered for any CPUs that recently came online or
went offline. This state is maintained in per-leaf rcu_node bitmasks,
with the buffered state in ->qsmaskinitnext and the state for the upcoming
GP in ->qsmaskinit. At the end of this RCU_GP_ONOFF state, each bit in
->qsmaskinit will correspond to a CPU that must pass through a quiescent
state before the upcoming grace period is allowed to complete.

However, a leaf rcu_node structure with an all-zeroes ->qsmaskinit
cannot necessarily be ignored. In preemptible RCU, there might well be
tasks still in RCU read-side critical sections that were first preempted
while running on one of the CPUs managed by this structure. Such tasks
will be queued on this structure's ->blkd_tasks list. Only after this
list fully drains can this leaf rcu_node structure be ignored, and even
then only if none of its CPUs have come back online in the meantime.
Once that happens, the ->qsmaskinit masks further up the tree will be
updated to exclude this leaf rcu_node structure.

Once the ->qsmaskinitnext and ->qsmaskinit fields have been updated
as needed, the GP kthread transitions to RCU_GP_INIT.

d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS]

The purpose of the RCU_GP_INIT state is to copy each ->qsmaskinit to
the ->qsmask field within each rcu_node structure. This copying is done
breadth-first from the root to the leaves. Why not just copy directly
from ->qsmaskinitnext to ->qsmask? Because the ->qsmaskinitnext masks
can change in the meantime as additional CPUs come online or go offline.
Such changes would result in inconsistencies in the ->qsmask fields up and
down the tree, which could in turn result in too-short grace periods or
grace-period hangs. These issues are avoided by snapshotting the leaf
rcu_node structures' ->qsmaskinitnext fields into their ->qsmaskinit
counterparts, generating a consistent set of ->qsmaskinit fields
throughout the tree, and only then copying these consistent ->qsmaskinit
fields to their ->qsmask counterparts.

Once this initialization step is complete, the GP kthread transitions
to RCU_GP_WAIT_FQS, where it waits to do a force-quiescent-state scan
on the one hand or for the end of the grace period on the other.

e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS]

The RCU_GP_WAIT_FQS state waits for one of three things: (1) An
explicit request to do a force-quiescent-state scan, (2) The end of
the grace period, or (3) A short interval of time, after which it
will do a force-quiescent-state (FQS) scan. The explicit request can
come from rcutorture or from any CPU that has too many RCU callbacks
queued (see the qhimark kernel parameter and the RCU_GP_FLAG_OVLD
flag). The aforementioned "short period of time" is specified by the
jiffies_till_first_fqs boot parameter for a given grace period's first
FQS scan and by the jiffies_till_next_fqs for later FQS scans.

Either way, once the wait is over, the GP kthread transitions to
RCU_GP_DOING_FQS.

f. [RCU_GP_DOING_FQS] -> [RCU_GP_CLEANUP]

The RCU_GP_DOING_FQS state performs an FQS scan. Each such scan carries
out two functions for any CPU whose bit is still set in its leaf rcu_node
structure's ->qsmask field, that is, for any CPU that has not yet reported
a quiescent state for the current grace period:

i. Report quiescent states on behalf of CPUs that have been observed
to be idle (from an RCU perspective) since the beginning of the
grace period.

ii. If the current grace period is too old, take various actions to
encourage holdout CPUs to pass through quiescent states, including
enlisting the aid of any calls to cond_resched() and might_sleep(),
and even including IPIing the holdout CPUs.

These checks are skipped for any leaf rcu_node structure with a all-zero
->qsmask field, however such structures are subject to RCU priority
boosting if there are tasks on a given structure blocking the current
grace period. The end of the grace period is detected when the root
rcu_node structure's ->qsmask is zero and when there are no longer any
preempted tasks blocking the current grace period. (No, this last check
is not redundant. To see this, consider an rcu_node tree having exactly
one structure that serves as both root and leaf.)

Once the end of the grace period is detected, the GP kthread transitions
to RCU_GP_CLEANUP.

g. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED]

The RCU_GP_CLEANUP state marks the end of grace period by updating the
rcu_state structure's ->gp_seq field and also all rcu_node structures'
->gp_seq field. As before, the rcu_node tree is traversed in breadth
first order. Once this update is complete, the GP kthread transitions
to the RCU_GP_CLEANED state.

i. [RCU_GP_CLEANED] -> [RCU_GP_INIT]

Once in the RCU_GP_CLEANED state, the GP kthread immediately transitions
into the RCU_GP_INIT state.

j. The role of timers.

If there is at least one idle CPU, and if timers are not firing, the
transition from RCU_GP_DOING_FQS to RCU_GP_CLEANUP will never happen.
Timers can fail to fire for a number of reasons, including issues in
timer configuration, issues in the timer framework, and failure to handle
softirqs (for example, when there is a storm of interrupts). Whatever the
reason, if the timers fail to fire, the GP kthread will never be awakened,
resulting in RCU CPU stall warnings and eventually in OOM.

However, an RCU CPU stall warning has a large number of potential causes,
as documented in Documentation/RCU/stallwarn.rst. This commit therefore
adds analysis to the RCU CPU stall-warning code to emit an additional
message if the cause of the stall is likely to be timer failure.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 147c6852 22-Dec-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Do any deferred nocb wakeups at CPU offline time

Because the need to wake a nocb GP kthread ("rcuog") is sometimes
detected when wakeups cannot be done, these wakeups can be deferred.
The wakeups are then carried out by calls to do_nocb_deferred_wakeup()
at various safe points in the code, including RCU's idle hooks. However,
when a CPU goes offline, it invokes arch_cpu_idle_dead() without invoking
any of RCU's idle hooks.

This commit therefore adds a call to do_nocb_deferred_wakeup() in
rcu_report_dead() in order to handle any deferred wakeups that have been
requested by the outgoing CPU.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 634954c2 13-Nov-2020 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Locally accelerate callbacks as long as offloading isn't complete

The local callbacks processing checks if any callbacks need acceleration.
This commit carries out this checking under nocb lock protection in
the middle of toggle operations, during which time rcu_core() executes
concurrently with GP/CB kthreads.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Inspired-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 32aa2f41 13-Nov-2020 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Process batch locally as long as offloading isn't complete

This commit makes sure to process the callbacks locally (via either
RCU_SOFTIRQ or the rcuc kthread) whenever the segcblist isn't entirely
offloaded. This ensures that callbacks are invoked one way or another
while a CPU is in the middle of a toggle operation.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Inspired-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e3abe959 13-Nov-2020 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Only cond_resched() from actual offloaded batch processing

During a toggle operations, rcu_do_batch() may be invoked concurrently
by softirqs and offloaded processing for a given CPU's callbacks.
This commit therefore makes sure cond_resched() is invoked only from
the offloaded context.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Inspired-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 126d9d49 13-Nov-2020 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Always init segcblist on CPU up

How the rdp->cblist enabled state is treated at CPU-hotplug time depends
on whether or not that ->cblist is offloaded.

1) Not offloaded: The ->cblist is disabled when the CPU goes down. All
its callbacks are migrated and none can to enqueued until after some
later CPU-hotplug operation brings the CPU back up.

2) Offloaded: The ->cblist is not disabled on CPU down because the CB/GP
kthreads must finish invoking the remaining callbacks. There is thus
no need to re-enable it on CPU up.

Since the ->cblist offloaded state is set in stone at boot, it cannot
change between CPU down and CPU up. So 1) and 2) are symmetrical.

However, given runtime toggling of the offloaded state, there are two
additional asymmetrical scenarios:

3) The ->cblist is not offloaded when the CPU goes down. The ->cblist
is later toggled to offloaded and then the CPU comes back up.

4) The ->cblist is offloaded when the CPU goes down. The ->cblist is
later toggled to no longer be offloaded and then the CPU comes back up.

Scenario 4) is currently handled correctly. The ->cblist remains enabled
on CPU down and gets re-initialized on CPU up. The toggling operation
will wait until ->cblist is empty, so ->cblist will remain empty until
CPU-up time.

The scenario 3) would run into trouble though, as the rdp is disabled
on CPU down and not re-initialized/re-enabled on CPU up. Except that
in this case, ->cblist is guaranteed to be empty because all its
callbacks were migrated away at CPU-down time. And the CPU-up code
already initializes and enables any empty ->cblist structures in order
to handle the possibility of early-boot invocations of call_rcu() in
the case where such invocations don't occur. So all that need be done
is to adjust the locking.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Inspired-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8d346d43 13-Nov-2020 Frederic Weisbecker <frederic@kernel.org>

rcu/nocb: Provide basic callback offloading state machine bits

Offloading and de-offloading RCU callback processes must be done
carefully. There must never be a time at which callback processing is
disabled because the task driving the offloading or de-offloading might be
preempted or otherwise stalled at that point in time, which would result
in OOM due to calbacks piling up indefinitely. This implies that there
will be times during which a given CPU's callbacks might be concurrently
invoked by both that CPU's RCU_SOFTIRQ handler (or, equivalently, that
CPU's rcuc kthread) and by that CPU's rcuo kthread.

This situation could fatally confuse both rcu_barrier() and the
CPU-hotplug offlining process, so these must be excluded during any
concurrent-callback-invocation period. In addition, during times of
concurrent callback invocation, changes to ->cblist must be protected
both as needed for RCU_SOFTIRQ and as needed for the rcuo kthread.

This commit therefore defines and documents the states for a state
machine that coordinates offloading and deoffloading.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Inspired-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b4e6039e 18-Nov-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/segcblist: Add debug checks for segment lengths

This commit adds debug checks near the end of rcu_do_batch() that emit
warnings if an empty rcu_segcblist structure has non-zero segment counts,
or, conversely, if a non-empty structure has all-zero segment counts.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Fix queue/segment-length checks. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3afe7fa5 14-Nov-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/trace: Add tracing for how segcb list changes

This commit adds tracing to track how the segcb list changes before/after
acceleration, during queuing and during dequeuing.

This tracing helped discover an optimization that avoided needless GP
requests when no callbacks were accelerated. The tracing overhead is
minimal as each segment's length is now stored in the respective segment.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 68804cf1 14-Oct-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: segcblist: Remove redundant smp_mb()s

The full memory barriers in rcu_segcblist_enqueue() and in rcu_do_batch()
are not needed because rcu_segcblist_add_len(), and thus also
rcu_segcblist_inc_len(), already includes a memory barrier *before*
and *after* the length of the list is updated.

This commit therefore removes these redundant smp_mb() invocations.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a649d25d 19-Nov-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees

This commit adds a number of lockdep_assert_irqs_disabled() calls
to rcu_sched_clock_irq() and a number of the functions that it calls.
The point of this is to help track down a situation where lockdep appears
to be insisting that interrupts are enabled within these functions, which
should only ever be invoked from the scheduling-clock interrupt handler.

Link: https://lore.kernel.org/lkml/20201111133813.GA81547@elver.google.com/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8b9a0ecc 15-Dec-2020 Scott Wood <swood@redhat.com>

rcu: Unconditionally use rcuc threads on PREEMPT_RT

PREEMPT_RT systems have long used the rcutree.use_softirq kernel
boot parameter to avoid use of RCU_SOFTIRQ handlers, which can disrupt
real-time applications by invoking callbacks during return from interrupts
that arrived while executing time-critical code. This kernel boot
parameter instead runs RCU core processing in an 'rcuc' kthread, thus
allowing the scheduler to do its job of avoiding disrupting time-critical
code.

This commit therefore disables the rcutree.use_softirq kernel boot
parameter on PREEMPT_RT systems, thus forcing such systems to do RCU
core processing in 'rcuc' kthreads. This approach has long been in
use by users of the -rt patchset, and there have been no complaints.
There is therefore no way for the system administrator to override this
choice, at least without modifying and rebuilding the kernel.

Signed-off-by: Scott Wood <swood@redhat.com>
[bigeasy: Reword commit message]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Update kernel-parameters.txt accordingly. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 84109ab5 20-Nov-2020 Zqiang <qiang.zhang@windriver.com>

rcu: Record kvfree_call_rcu() call stack for KASAN

This commit adds a call to kasan_record_aux_stack() in kvfree_call_rcu()
in order to record the call stack of the code that caused the object
to be freed. Please note that this function does not update the
allocated/freed state, which is important because RCU readers might
still be referencing this object.

Acked-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6bc33582 03-Nov-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Make rcu_do_batch count how many callbacks were executed

The rcu_do_batch() function extracts the ready-to-invoke callbacks
from the rcu_segcblist located in the ->cblist field of the current
CPU's rcu_data structure. These callbacks are first moved to a local
(unsegmented) rcu_cblist. The rcu_do_batch() function then uses this
rcu_cblist's ->len field to count how many CBs it has invoked, but it
does so by counting that field down from zero. Finally, this function
negates the value in this ->len field (resulting in a positive number)
and subtracts the result from the ->len field of the current CPU's
->cblist field.

Except that it is sometimes necessary for rcu_do_batch() to stop invoking
callbacks mid-stream, despite there being more ready to invoke, for
example, if a high-priority task wakes up. In this case the remaining
not-yet-invoked callbacks are requeued back onto the CPU's ->cblist,
but remain in the ready-to-invoke segment of that list. As above, the
negative of the local rcu_cblist's ->len field is still subtracted from
the ->len field of the current CPU's ->cblist field.

The design of counting down from 0 is confusing and error-prone, plus
use of a positive count will make it easier to provide a uniform and
consistent API to deal with the per-segment counts that are added
later in this series. For example, rcu_segcblist_extract_done_cbs()
can unconditionally populate the resulting unsegmented list's ->len
field during extraction.

This commit therefore explicitly counts how many callbacks were executed
in rcu_do_batch() itself, counting up from zero, and then uses that
to update the per-CPU segcb list's ->len field, without relying on the
downcounting of rcl->len from zero.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7a9f50a0 15-Jun-2020 Peter Zijlstra <peterz@infradead.org>

irq_work: Cleanup

Get rid of the __call_single_node union and clean up the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>


# 56292e86 29-Oct-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/tree: Defer kvfree_rcu() allocation to a clean context

The current memmory-allocation interface causes the following difficulties
for kvfree_rcu():

a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
complain about violation of the nesting rules, as in "BUG: Invalid
wait context". This Kconfig option checks for proper raw_spinlock
vs. spinlock nesting, in particular, it is not legal to acquire a
spinlock_t while holding a raw_spinlock_t.

This is a problem because kfree_rcu() uses raw_spinlock_t whereas the
"page allocator" internally deals with spinlock_t to access to its
zones. The code also can be broken from higher level of view:
<snip>
raw_spin_lock(&some_lock);
kfree_rcu(some_pointer, some_field_offset);
<snip>

b) If built with CONFIG_PREEMPT_RT, spinlock_t is converted into
sleeplock. This means that invoking the page allocator from atomic
contexts results in "BUG: scheduling while atomic".

c) Please note that call_rcu() is already invoked from raw atomic context,
so it is only reasonable to expaect that kfree_rcu() and kvfree_rcu()
will also be called from atomic raw context.

This commit therefore defers page allocation to a clean context using the
combination of an hrtimer and a workqueue. The hrtimer stage is required
in order to avoid deadlocks with the scheduler. This deferred allocation
is required only when kvfree_rcu()'s per-CPU page cache is empty.

Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/
Fixes: 3042f83f19be ("rcu: Support reclaim for head-less object")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 354c3f0e 14-Oct-2020 Zhouyi Zhou <zhouzhouyi@gmail.com>

rcu: Fix a typo in rcu_blocking_is_gp() header comment

This commit fixes a typo in the rcu_blocking_is_gp() function's header
comment.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4d60b475 13-Oct-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Prevent lockdep-RCU splats on lock acquisition/release

The rcu_cpu_starting() and rcu_report_dead() functions transition the
current CPU between online and offline state from an RCU perspective.
Unfortunately, this means that the rcu_cpu_starting() function's lock
acquisition and the rcu_report_dead() function's lock releases happen
while the CPU is offline from an RCU perspective, which can result
in lockdep-RCU splats about using RCU from an offline CPU. And this
situation can also result in too-short grace periods, especially in
guest OSes that are subject to vCPU preemption.

This commit therefore uses sequence-count-like synchronization to forgive
use of RCU while RCU thinks a CPU is offline across the full extent of
the rcu_cpu_starting() and rcu_report_dead() function's lock acquisitions
and releases.

One approach would have been to use the actual sequence-count primitives
provided by the Linux kernel. Unfortunately, the resulting code looks
completely broken and wrong, and is likely to result in patches that
break RCU in an attempt to address this appearance of broken wrongness.
Plus there is no net savings in lines of code, given the additional
explicit memory barriers required.

Therefore, this sequence count is instead implemented by a new ->ofl_seq
field in the rcu_node structure. If this counter's value is an odd
number, RCU forgives RCU read-side critical sections on other CPUs covered
by the same rcu_node structure, even if those CPUs are offline from
an RCU perspective. In addition, if a given leaf rcu_node structure's
->ofl_seq counter value is an odd number, rcu_gp_init() delays starting
the grace period until that counter value changes.

[ paulmck: Apply Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# bd56e0a4 07-Oct-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs

Testing showed that rcu_pending() can return 1 when offloaded callbacks
are ready to execute. This invokes RCU core processing, for example,
by raising RCU_SOFTIRQ, eventually resulting in a call to rcu_core().
However, rcu_core() explicitly avoids in any way manipulating offloaded
callbacks, which are instead handled by the rcuog and rcuoc kthreads,
which work independently of rcu_core().

One exception to this independence is that rcu_core() invokes
do_nocb_deferred_wakeup(), however, rcu_pending() also checks
rcu_nocb_need_deferred_wakeup() in order to correctly handle this case,
invoking rcu_core() when needed.

This commit therefore avoids needlessly invoking RCU core processing
by checking rcu_segcblist_ready_cbs() only on non-offloaded CPUs.
This reduces overhead, for example, by reducing softirq activity.

This change passed 30 minute tests of TREE01 through TREE09 each.

On TREE08, there is at most 150us from the time that rcu_pending() chose
not to invoke RCU core processing to the time when the ready callbacks
were invoked by the rcuoc kthread. This provides further evidence that
there is no need to invoke rcu_core() for offloaded callbacks that are
ready to invoke.

Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# d2098b44 29-Sep-2020 Peter Zijlstra <peterz@infradead.org>

rcu,ftrace: Fix ftrace recursion

Kim reported that perf-ftrace made his box unhappy. It turns out that
commit:

ff5c4f5cad33 ("rcu/tree: Mark the idle relevant functions noinstr")

removed one too many notrace qualifiers, probably due to there not being
a helpful comment.

This commit therefore reinstates the notrace and adds a comment to avoid
losing it again.

[ paulmck: Apply Steven Rostedt's feedback on the comment. ]
Fixes: ff5c4f5cad33 ("rcu/tree: Mark the idle relevant functions noinstr")
Reported-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7c47ee5a 03-Oct-2020 Joe Perches <joe@perches.com>

rcu/tree: Make struct kernel_param_ops definitions const

These should be const, so make it so.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 9f866dac 29-Sep-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Add a warning if CPU being onlined did not report QS already

Currently, rcu_cpu_starting() checks to see if the RCU core expects a
quiescent state from the incoming CPU. However, the current interaction
between RCU quiescent-state reporting and CPU-hotplug operations should
mean that the incoming CPU never needs to report a quiescent state.
First, the outgoing CPU reports a quiescent state if needed. Second,
the race where the CPU is leaving just as RCU is initializing a new
grace period is handled by an explicit check for this condition. Third,
the CPU's leaf rcu_node structure's ->lock serializes these checks.

This means that if rcu_cpu_starting() ever feels the need to report
a quiescent state, then there is a bug somewhere in the CPU hotplug
code or the RCU grace-period handling code. This commit therefore
adds a WARN_ON_ONCE() to bring that bug to everyone's attention.

Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ed73860c 22-Sep-2020 Neeraj Upadhyay <neeraju@codeaurora.org>

rcu: Fix single-CPU check in rcu_blocking_is_gp()

Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses
num_online_cpus() to determine whether there is only one CPU online. When
there is only a single CPU online, the simple fact that synchronize_rcu()
could be legally called implies that a full grace period has elapsed.
Therefore, in the single-CPU case, synchronize_rcu() simply returns
immediately. Unfortunately, num_online_cpus() is unreliable while a
CPU-hotplug operation is transitioning to or from single-CPU operation
because:

1. num_online_cpus() uses atomic_read(&__num_online_cpus) to
locklessly sample the number of online CPUs. The hotplug locks
are not held, which means that an incoming CPU can concurrently
update this count. This in turn means that an RCU read-side
critical section on the incoming CPU might observe updates
prior to the grace period, but also that this critical section
might extend beyond the end of the optimized synchronize_rcu().
This breaks RCU's fundamental guarantee.

2. In addition, num_online_cpus() does no ordering, thus providing
another way that RCU's fundamental guarantee can be broken by
the current code.

3. The most probable failure mode happens on outgoing CPUs.
The outgoing CPU updates the count of online CPUs in the
CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in
and of itself due to preemption being disabled at the call
to num_online_cpus(). Unfortunately, after that stop-machine
handler returns, the CPU takes one last trip through the
scheduler (which has RCU readers) and, after the resulting
context switch, one final dive into the idle loop. During this
time, RCU needs to keep track of two CPUs, but num_online_cpus()
will say that there is only one, which in turn means that the
surviving CPU will incorrectly ignore the outgoing CPU's RCU
read-side critical sections.

This problem is illustrated by the following litmus test in which P0()
corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU.
The herd7 tool confirms that the "exists" clause can be satisfied,
thus demonstrating that this breakage can happen according to the Linux
kernel memory model.

{
int x = 0;
atomic_t numonline = ATOMIC_INIT(1);
}

P0(int *x, atomic_t *numonline)
{
int r0;
WRITE_ONCE(*x, 1);
r0 = atomic_read(numonline);
if (r0 == 1) {
smp_mb();
} else {
synchronize_rcu();
}
WRITE_ONCE(*x, 2);
}

P1(int *x, atomic_t *numonline)
{
int r0; int r1;

atomic_inc(numonline);
smp_mb();
rcu_read_lock();
r0 = READ_ONCE(*x);
smp_rmb();
r1 = READ_ONCE(*x);
rcu_read_unlock();
}

locations [x;numonline;]

exists (1:r0=0 /\ 1:r1=2)

It is important to note that these problems arise only when the system
is transitioning to or from single-CPU operation.

One solution would be to hold the CPU-hotplug locks while sampling
num_online_cpus(), which was in fact the intent of the (redundant)
preempt_disable() and preempt_enable() surrounding this call to
num_online_cpus(). Actually blocking CPU hotplug would not only result
in excessive overhead, but would also unnecessarily impede CPU-hotplug
operations.

This commit therefore follows long-standing RCU tradition by maintaining
a separate RCU-specific set of CPU-hotplug books.

This separate set of books is implemented by a new ->n_online_cpus field
in the rcu_state structure that maintains RCU's count of the online CPUs.
This count is incremented early in the CPU-online process, so that
the critical transition away from single-CPU operation will occur when
there is only a single CPU. Similarly for the critical transition to
single-CPU operation, the counter is decremented late in the CPU-offline
process, again while there is only a single CPU. Because there is only
ever a single CPU when the ->n_online_cpus field undergoes the critical
1->2 and 2->1 transitions, full memory ordering and mutual exclusion is
provided implicitly and, better yet, for free.

In the case where the CPU is coming online, nothing will happen until
the current CPU helps it come online. Therefore, the new CPU will see
all accesses prior to the optimized grace period, which means that RCU
does not need to further delay this new CPU. In the case where the CPU
is going offline, the outgoing CPU is totally out of the picture before
the optimized grace period starts, which means that this outgoing CPU
cannot see any of the accesses following that grace period. Again,
RCU needs no further interaction with the outgoing CPU.

This does mean that synchronize_rcu() will unnecessarily do a few grace
periods the hard way just before the second CPU comes online and just
after the second-to-last CPU goes offline, but it is not worth optimizing
this uncommon case.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e3771c85 21-Sep-2020 Frederic Weisbecker <frederic@kernel.org>

rcu: Implement rcu_segcblist_is_offloaded() config dependent

This commit simplifies the use of the rcu_segcblist_is_offloaded() API so
that its callers no longer need to check the RCU_NOCB_CPU Kconfig option.
Note that rcu_segcblist_is_offloaded() is defined in the header file,
which means that the generated code should be just as efficient as before.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6dbce04d 16-Nov-2020 Peter Zijlstra <peterz@infradead.org>

rcu: Allow rcu_irq_enter_check_tick() from NMI

Eugenio managed to tickle #PF from NMI context which resulted in
hitting a WARN in RCU through irqentry_enter() ->
__rcu_irq_enter_check_tick().

However, this situation is perfectly sane and does not warrant an
WARN. The #PF will (necessarily) be atomic and not require messing
with the tick state, so early return is correct. This commit
therefore removes the WARN.

Fixes: aaf2bc50df1f ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
Reported-by: "Eugenio Pérez" <eupm90@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 04e613de 06-Nov-2020 Will Deacon <will@kernel.org>

arm64: smp: Tell RCU about CPUs that fail to come online

Commit ce3d31ad3cac ("arm64/smp: Move rcu_cpu_starting() earlier") ensured
that RCU is informed early about incoming CPUs that might end up calling
into printk() before they are online. However, if such a CPU fails the
early CPU feature compatibility checks in check_local_cpu_capabilities(),
then it will be powered off or parked without informing RCU, leading to
an endless stream of stalls:

| rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
| rcu: 2-O...: (0 ticks this GP) idle=002/1/0x4000000000000000 softirq=0/0 fqs=2593
| (detected by 0, t=5252 jiffies, g=9317, q=136)
| Task dump for CPU 2:
| task:swapper/2 state:R running task stack: 0 pid: 0 ppid: 1 flags:0x00000028
| Call trace:
| ret_from_fork+0x0/0x30

Ensure that the dying CPU invokes rcu_report_dead() prior to being powered
off or parked.

Cc: Qian Cai <cai@redhat.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Suggested-by: Qian Cai <cai@redhat.com>
Link: https://lore.kernel.org/r/20201105222242.GA8842@willie-the-truck
Link: https://lore.kernel.org/r/20201106103602.9849-3-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>


# 3fcd6a23 03-Sep-2020 Paul E. McKenney <paulmck@kernel.org>

x86/cpu: Avoid cpuinfo-induced IPIing of idle CPUs

Currently, accessing /proc/cpuinfo sends IPIs to idle CPUs in order to
learn their clock frequency. Which is a bit strange, given that waking
them from idle likely significantly changes their clock frequency.
This commit therefore avoids sending /proc/cpuinfo-induced IPIs to
idle CPUs.

[ paulmck: Also check for idle in arch_freq_prepare_all(). ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>


# 4230e2de 21-Oct-2020 Zong Li <zong.li@sifive.com>

stop_machine, rcu: Mark functions as notrace

Some architectures assume that the stopped CPUs don't make function calls
to traceable functions when they are in the stopped state. See also commit
cb9d7fd51d9f ("watchdog: Mark watchdog touch functions as notrace").

Violating this assumption causes kernel crashes when switching tracer on
RISC-V.

Mark rcu_momentary_dyntick_idle() and stop_machine_yield() notrace to
prevent this.

Fixes: 4ecf0a43e729 ("processor: get rid of cpu_relax_yield")
Fixes: 366237e7b083 ("stop_machine: Provide RCU quiescent state in multi_cpu_stop()")
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Atish Patra <atish.patra@wdc.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201021073839.43935-1-zong.li@sifive.com


# 72a2fbda 10-Sep-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rcu/tree: docs: document bkvcache new members at struct kfree_rcu_cpu

Changeset 53c72b590b3a ("rcu/tree: cache specified number of objects")
added new members for struct kfree_rcu_cpu, but didn't add the
corresponding at the kernel-doc markup, as repoted when doing
"make htmldocs":
./kernel/rcu/tree.c:3113: warning: Function parameter or member 'bkvcache' not described in 'kfree_rcu_cpu'
./kernel/rcu/tree.c:3113: warning: Function parameter or member 'nr_bkv_objs' not described in 'kfree_rcu_cpu'

So, move the description for bkvcache to kernel-doc, and add a
description for nr_bkv_objs.

Fixes: 53c72b590b3a ("rcu/tree: cache specified number of objects")
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>


# 3ad1c8ef 20-Sep-2020 Borislav Petkov <bp@suse.de>

rcu/tree: Export rcu_idle_{enter,exit} to modules

Fix this link error:

ERROR: modpost: "rcu_idle_enter" [drivers/acpi/processor.ko] undefined!
ERROR: modpost: "rcu_idle_exit" [drivers/acpi/processor.ko] undefined!

when CONFIG_ACPI_PROCESSOR is built as module. PeterZ says that in light
of ARM needing those soon too, they should simply be exported.

Fixes: 1fecfdbb7acc ("ACPI: processor: Take over RCU-idle for C3-BM idle")
Reported-by: Sven Joachim <svenjoac@gmx.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paul E. McKenney <paulmckrcu@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 70060b87 14-Aug-2020 Zqiang <qiang.zhang@windriver.com>

rcu: Shrink each possible cpu krcp

CPUs can go offline shortly after kfree_call_rcu() has been invoked,
which can leave memory stranded until those CPUs come back online.
This commit therefore drains the kcrp of each CPU, not just the
ones that happen to be online.

Acked-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# cfeac397 20-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp()

The "cpu" parameter to rcu_report_qs_rdp() is not used, with rdp->cpu
being used instead. Furtheremore, every call to rcu_report_qs_rdp()
invokes it on rdp->cpu. This commit therefore removes this unused "cpu"
parameter and converts a check of rdp->cpu against smp_processor_id()
to a WARN_ON_ONCE().

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# aa40c138 10-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Report QS for outermost PREEMPT=n rcu_read_unlock() for strict GPs

The CONFIG_PREEMPT=n instance of rcu_read_unlock is even more
aggressively than that of CONFIG_PREEMPT=y in deferring reporting
quiescent states to the RCU core. This is just what is wanted in normal
use because it reduces overhead, but the resulting delay is not what
is wanted for kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
This commit therefore adds an rcu_read_unlock_strict() function that
checks for exceptional conditions, and reports the newly started
quiescent state if it is safe to do so, also doing a spin-delay if
requested via rcutree.rcu_unlock_delay. This commit also adds a call
to rcu_read_unlock_strict() from the CONFIG_PREEMPT=n instance of
__rcu_read_unlock().

[ paulmck: Fixed bug located by kernel test robot <lkp@intel.com> ]
Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a657f261 08-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Execute RCU reader shortly after rcu_core for strict GPs

A kernel built with CONFIG_RCU_STRICT_GRACE_PERIOD=y needs a quiescent
state to appear very shortly after a CPU has noticed a new grace period.
Placing an RCU reader immediately after this point is ineffective because
this normally happens in softirq context, which acts as a big RCU reader.
This commit therefore introduces a new per-CPU work_struct, which is
used at the end of rcu_core() processing to schedule an RCU read-side
critical section from within a clean environment.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4e025f52 06-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: IPI all CPUs at GP end for strict GPs

Currently, each CPU discovers the end of a given grace period on its
own time, which is again good for efficiency but bad for fast grace
periods, given that it is things like kfree() within the RCU callbacks
that will cause trouble for pointers leaked from RCU read-side critical
sections. This commit therefore uses on_each_cpu() to IPI each CPU
after grace-period cleanup in order to inform each CPU of the end of
the old grace period in a timely manner, but only in kernels build with
CONFIG_RCU_STRICT_GRACE_PERIOD=y.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 933ada2c 06-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: IPI all CPUs at GP start for strict GPs

Currently, each CPU discovers the beginning of a given grace period
on its own time, which is again good for efficiency but bad for fast
grace periods. This commit therefore uses on_each_cpu() to IPI each
CPU after grace-period initialization in order to inform each CPU of
the new grace period in a timely manner, but only in kernels build with
CONFIG_RCU_STRICT_GRACE_PERIOD=y.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 1a2f5d57 06-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Attempt QS when CPU discovers GP for strict GPs

A given CPU normally notes a new grace period during one RCU_SOFTIRQ,
but avoids reporting the corresponding quiescent state until some later
RCU_SOFTIRQ. This leisurly approach improves efficiency by increasing
the number of update requests served by each grace period, but is not
what is needed for kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.

This commit therefore adds a new rcu_strict_gp_check_qs() function
which, in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, simply enters and
immediately exist an RCU read-side critical section. If the CPU is
in a quiescent state, the rcu_read_unlock() will attempt to report an
immediate quiescent state. This rcu_strict_gp_check_qs() function is
invoked from note_gp_changes(), so that a CPU just noticing a new grace
period might immediately report a quiescent state for that grace period.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 29fc5f93 06-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Force DEFAULT_RCU_BLIMIT to 1000 for strict RCU GPs

The value of DEFAULT_RCU_BLIMIT is normally set to 10, the idea being to
avoid needless response-time degradation due to RCU callback invocation.
However, when CONFIG_RCU_STRICT_GRACE_PERIOD=y it is better to avoid
throttling callback execution in order to better detect pointer
leaks from RCU read-side critical sections. This commit therefore
sets the value of DEFAULT_RCU_BLIMIT to 1000 in kernels built with
CONFIG_RCU_STRICT_GRACE_PERIOD=y.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# aecd34b9 05-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Restrict default jiffies_till_first_fqs for strict RCU GPs

If there are idle CPUs, RCU's grace-period kthread will wait several
jiffies before even thinking about polling them. This promotes
efficiency, which is normally a good thing, but when the kernel
has been built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, we care more
about short grace periods. This commit therefore restricts the
default jiffies_till_first_fqs value to zero in kernels built with
CONFIG_RCU_STRICT_GRACE_PERIOD=y, which causes RCU's grace-period kthread
to poll for idle CPUs immediately after starting a grace period.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7f2a53c2 17-Aug-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove unused __rcu_is_watching() function

The x86/entry work removed all uses of __rcu_is_watching(), therefore
this commit removes it entirely.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 666ca290 07-Aug-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Make FQS more aggressive in complaining about offline CPUs

The RCU grace-period kthread's force-quiescent state (FQS) loop should
never see an offline CPU that has not yet reported a quiescent state.
After all, the offline CPU should have reported a quiescent state
during the CPU-offline process, or, failing that, by rcu_gp_init()
if it ran concurrently with either the CPU going offline or the last
task on a leaf rcu_node structure exiting its RCU read-side critical
section while all CPUs corresponding to that structure are offline.
The FQS loop should therefore complain if it does see an offline CPU
that has not yet reported a quiescent state.

And it does, but only once the grace period has been in force for a
full second. This commit therefore makes this warning more aggressive,
so that it will trigger as soon as the condition makes its appearance.

Light testing with TREE03 and hotplug shows no warnings. This commit
also converts the warning to WARN_ON_ONCE() in order to stave off possible
log spam.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f37599e6 07-Aug-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Clarify comments about FQS loop reporting quiescent states

Since at least v4.19, the FQS loop no longer reports quiescent states
for offline CPUs except in emergency situations.

This commit therefore fixes the comment in rcu_gp_init() to match the
current code.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c0f97f20 24-Jul-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_cpu_started per-CPU variable to rcu_data

When the rcu_cpu_started per-CPU variable was added by commit
f64c6013a202 ("rcu/x86: Provide early rcu_cpu_starting() callback"),
there were multiple sets of per-CPU rcu_data structures. Therefore, the
rcu_cpu_started flag was added as a separate per-CPU variable. But now
there is only one set of per-CPU rcu_data structures, so this commit
moves rcu_cpu_started to a new ->cpu_started field in that structure.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a2b354b9 23-Jun-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_resched_ns

Given that sysfs can change the value of rcu_resched_ns at any time,
this commit adds a READ_ONCE() to the sole access to that variable.
While in the area, this commit also adds bounds checking, clamping the
value to at least a millisecond, but no longer than a second.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b5374b2d 23-Jun-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor

Given that sysfs can change the value of rcu_divisor at any time, this
commit adds a READ_ONCE to the sole access to that variable. While in
the area, this commit also adds bounds checking, clamping the value to
a shift that makes sense for a signed long.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 9b1ce0ac 22-Jun-2020 Neeraj Upadhyay <neeraju@codeaurora.org>

rcu/tree: Remove CONFIG_PREMPT_RCU check in force_qs_rnp()

Originally, the call to rcu_preempt_blocked_readers_cgp() from
force_qs_rnp() had to be conditioned on CONFIG_PREEMPT_RCU=y, as in
commit a77da14ce9af ("rcu: Yet another fix for preemption and CPU
hotplug"). However, there is now a CONFIG_PREEMPT_RCU=n definition of
rcu_preempt_blocked_readers_cgp() that unconditionally returns zero, so
invoking it is now safe. In addition, the CONFIG_PREEMPT_RCU=n definition
of rcu_initiate_boost() simply releases the rcu_node structure's ->lock,
which is what happens when the "if" condition evaluates to false.

This commit therefore drops the IS_ENABLED(CONFIG_PREEMPT_RCU) check,
so that rcu_initiate_boost() is called only in CONFIG_PREEMPT_RCU=y
kernels when there are readers blocking the current grace period.
This does not change the behavior, but reduces code-reader confusion by
eliminating non-CONFIG_PREEMPT_RCU=y calls to rcu_initiate_boost().

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 9c392453 21-Jun-2020 Neeraj Upadhyay <neeraju@codeaurora.org>

rcu/tree: Force quiescent state on callback overload

On callback overload, it is necessary to quickly detect idle CPUs,
and rcu_gp_fqs_check_wake() checks for this condition. Unfortunately,
the code following the call to this function does not repeat this check,
which means that in reality no actual quiescent-state forcing, instead
only a couple of quick and pointless wakeups at the beginning of the
grace period.

This commit therefore adds a check for the RCU_GP_FLAG_OVLD flag in
the post-wakeup "if" statement in rcu_gp_fqs_loop().

Fixes: 1fca4d12f4637 ("rcu: Expedite first two FQS scans under callback-overload conditions")
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a7886e89 18-Jun-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/trace: Use gp_seq_req in acceleration's rcu_grace_period tracepoint

During acceleration of CB, the rsp's gp_seq is rcu_seq_snap'd. This is
the value used for acceleration - it is the value of gp_seq at which it
is safe the execute all callbacks in the callback list.

The rdp's gp_seq is not very useful for this scenario. Make
rcu_grace_period report the gp_seq_req instead as it allows one to
reason about how the acceleration works.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ebc3505d 17-Jun-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove KCSAN stubs

KCSAN is now in mainline, so this commit removes the stubs for the
data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS()
macros.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 26e760c9 07-Aug-2020 Walter Wu <walter-zh.wu@mediatek.com>

rcu: kasan: record and print call_rcu() call stack

Patch series "kasan: memorize and print call_rcu stack", v8.

This patchset improves KASAN reports by making them to have call_rcu()
call stack information. It is useful for programmers to solve
use-after-free or double-free memory issue.

The KASAN report was as follows(cleaned up slightly):

BUG: KASAN: use-after-free in kasan_rcu_reclaim+0x58/0x60

Freed by task 0:
kasan_save_stack+0x24/0x50
kasan_set_track+0x24/0x38
kasan_set_free_info+0x18/0x20
__kasan_slab_free+0x10c/0x170
kasan_slab_free+0x10/0x18
kfree+0x98/0x270
kasan_rcu_reclaim+0x1c/0x60

Last call_rcu():
kasan_save_stack+0x24/0x50
kasan_record_aux_stack+0xbc/0xd0
call_rcu+0x8c/0x580
kasan_rcu_uaf+0xf4/0xf8

Generic KASAN will record the last two call_rcu() call stacks and print up
to 2 call_rcu() call stacks in KASAN report. it is only suitable for
generic KASAN.

This feature considers the size of struct kasan_alloc_meta and
kasan_free_meta, we try to optimize the structure layout and size, lets it
get better memory consumption.

[1]https://bugzilla.kernel.org/show_bug.cgi?id=198437
[2]https://groups.google.com/forum/#!searchin/kasan-dev/better$20stack$20traces$20for$20rcu%7Csort:date/kasan-dev/KQsjT_88hDE/7rNUZprRBgAJ

This patch (of 4):

This feature will record the last two call_rcu() call stacks and prints up
to 2 call_rcu() call stacks in KASAN report.

When call_rcu() is called, we store the call_rcu() call stack into slub
alloc meta-data, so that the KASAN report can print rcu stack.

[1]https://bugzilla.kernel.org/show_bug.cgi?id=198437
[2]https://groups.google.com/forum/#!searchin/kasan-dev/better$20stack$20traces$20for$20rcu%7Csort:date/kasan-dev/KQsjT_88hDE/7rNUZprRBgAJ

[walter-zh.wu@mediatek.com: build fix]
Link: http://lkml.kernel.org/r/20200710162401.23816-1-walter-zh.wu@mediatek.com

Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Link: http://lkml.kernel.org/r/20200710162123.23713-1-walter-zh.wu@mediatek.com
Link: http://lkml.kernel.org/r/20200601050847.1096-1-walter-zh.wu@mediatek.com
Link: http://lkml.kernel.org/r/20200601050927.1153-1-walter-zh.wu@mediatek.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3042f83f 25-May-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu: Support reclaim for head-less object

Update the kvfree_call_rcu() function with head-less support.
This allows RCU to reclaim objects without an embedded rcu_head.

tree-RCU:
We introduce two chains of arrays to store SLAB-backed and vmalloc
pointers, each. Storage in either of these arrays does not require
embedding an rcu_head within the object.

Maintaining the arrays may become impossible due to high memory
pressure. For such cases there is an emergency path. Objects with
rcu_head inside are just queued on a backup rcu_head list. Later on
that list is drained. As for the head-less variant, as the current
context can sleep, the following emergency measures are applied:
a) Synchronously wait until a grace period has elapsed.
b) Call kvfree().

tiny-RCU:
For double argument calls, there are no new changes in behavior. For
single argument call, kvfree() is directly inlined on the current
stack after a synchronize_rcu() call. Note that for tiny-RCU, any
call to synchronize_rcu() is actually a quiescent state, therefore
it does nothing.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c408b215 25-May-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu: Rename *_kfree_callback/*_kfree_rcu_offset/kfree_call_*

The following changes are introduced:

1. Rename rcu_invoke_kfree_callback() to rcu_invoke_kvfree_callback(),
as well as the associated trace events, so the rcu_kfree_callback(),
becomes rcu_kvfree_callback(). The reason is to be aligned with kvfree()
notation.

2. Rename __is_kfree_rcu_offset to __is_kvfree_rcu_offset. All RCU
paths use kvfree() now instead of kfree(), thus rename it.

3. Rename kfree_call_rcu() to the kvfree_call_rcu(). The reason is,
it is capable of freeing vmalloc() memory now. Do the same with
__kfree_rcu() macro, it becomes __kvfree_rcu(), the goal is the
same.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5f3c8d62 25-May-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/tree: Maintain separate array for vmalloc ptrs

To do so, we use an array of kvfree_rcu_bulk_data structures.
It consists of two elements:
- index number 0 corresponds to slab pointers.
- index number 1 corresponds to vmalloc pointers.

Keeping vmalloc pointers separated from slab pointers makes
it possible to invoke the right freeing API for the right
kind of pointer.

It also prepares us for future headless support for vmalloc
and SLAB objects. Such objects cannot be queued on a linked
list and are instead directly into an array.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 53c72b59 25-May-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/tree: cache specified number of objects

In order to reduce the dynamic need for pages in kfree_rcu(),
pre-allocate a configurable number of pages per CPU and link
them in a list. When kfree_rcu() reclaims objects, the object's
container page is cached into a list instead of being released
to the low-level page allocator.

Such an approach provides O(1) access to free pages while also
reducing the number of requests to the page allocator. It also
makes the kfree_rcu() code to have free pages available during
a low memory condition.

A read-only sysfs parameter (rcu_min_cached_objs) reflects the
minimum number of allowed cached pages per CPU.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 69f08d39 25-May-2020 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

rcu/tree: Use static initializer for krc.lock

The per-CPU variable is initialized at runtime in
kfree_rcu_batch_init(). This function is invoked before
'rcu_scheduler_active' is set to 'RCU_SCHEDULER_RUNNING'.
After the initialisation, '->initialized' is to true.

The raw_spin_lock is only acquired if '->initialized' is
set to true. The worqueue item is only used if 'rcu_scheduler_active'
set to RCU_SCHEDULER_RUNNING which happens after initialisation.

Use a static initializer for krc.lock and remove the runtime
initialisation of the lock. Since the lock can now be always
acquired, remove the '->initialized' check.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 952371d6 25-May-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/tree: Move kfree_rcu_cpu locking/unlocking to separate functions

Introduce helpers to lock and unlock per-cpu "kfree_rcu_cpu"
structures. That will make kfree_call_rcu() more readable
and prevent programming errors.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 3af84862 25-May-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/tree: Simplify KFREE_BULK_MAX_ENTR macro

We can simplify KFREE_BULK_MAX_ENTR macro and get rid of
magic numbers which were used to make the structure to be
exactly one page.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 446044eb 25-May-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Make debug_objects logic independent of rcu_head

kfree_rcu()'s debug_objects logic uses the address of the object's
embedded rcu_head to queue/unqueue. Instead of this, make use of the
object's address itself as preparation for future headless kfree_rcu()
support.

Reviewed-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 594aa597 25-May-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu/tree: Repeat the monitor if any free channel is busy

It is possible that one of the channels cannot be detached
because its free channel is busy and previously queued data
has not been processed yet. On the other hand, another
channel can be successfully detached causing the monitor
work to stop.

Prevent that by rescheduling the monitor work if there are
any channels in the pending state after a detach attempt.

Fixes: 34c881745549e ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 4d291941 25-May-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Skip entry into the page allocator for PREEMPT_RT

To keep the kfree_rcu() code working in purely atomic sections on RT,
such as non-threaded IRQ handlers and raw spinlock sections, avoid
calling into the page allocator which uses sleeping locks on RT.

In fact, even if the caller is preemptible, the kfree_rcu() code is
not, as the krcp->lock is a raw spinlock.

Calling into the page allocator is optional and avoiding it should be
Ok, especially with the page pre-allocation support in future patches.
Such pre-allocation would further avoid the a need for a dynamically
allocated page in the first place.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Uladzislau Rezki <urezki@gmail.com>
Co-developed-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8ac88f71 25-May-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Keep kfree_rcu() awake during lock contention

On PREEMPT_RT kernels, the krcp spinlock gets converted to an rt-mutex
and causes kfree_rcu() callers to sleep. This makes it unusable for
callers in purely atomic sections such as non-threaded IRQ handlers and
raw spinlock sections. Fix it by converting the spinlock to a raw
spinlock.

Vetting all code paths, there is no reason to believe that the raw
spinlock will hurt RT latencies as it is not held for a long time.

Cc: bigeasy@linutronix.de
Cc: Uladzislau Rezki <urezki@gmail.com>
Reviewed-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8e11690d 04-May-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

rcu: Fix a kernel-doc warnings for "count"

There are some kernel-doc warnings:

./kernel/rcu/tree.c:2915: warning: Function parameter or member 'count' not described in 'kfree_rcu_cpu'

This commit therefore moves the comment for "count" to the kernel-doc
markup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c3cb47a6 15-Jun-2020 Randy Dunlap <rdunlap@infradead.org>

kernel/rcu/tree.c: Fix kernel-doc warnings

Fix kernel-doc warning:

../kernel/rcu/tree.c:959: warning: Excess function parameter 'irq' description in 'rcu_nmi_enter'

Fixes: cf7614e13c8f ("rcu: Refactor rcu_{nmi,irq}_{enter,exit}()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c6dfd72b 03-Jun-2020 Peter Enderborg <peter.enderborg@sony.com>

rcu: Stop shrinker loop

The count and scan can be separated in time, and there is a fair chance
that all work is already done when the scan starts, which might in turn
result in a needless retry. This commit therefore avoids this retry by
returning SHRINK_STOP.

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Peter Enderborg <peter.enderborg@sony.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 04b25a49 19-May-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr

The objtool complains about the call to rcu_cleanup_after_idle() from
rcu_nmi_enter(), so this commit adds instrumentation_begin() before that
call and instrumentation_end() after it.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 77865dea 07-May-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Grace-period-kthread related sleeps to idle priority

This commit converts the long-standing schedule_timeout_interruptible()
and schedule_timeout_uninterruptible() calls used by RCU's grace-period
kthread to schedule_timeout_idle(). This conversion avoids polluting
the load-average with RCU-related sleeping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e816d56f 01-May-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add callbacks-invoked counters

This commit adds a count of the callbacks invoked to the per-CPU rcu_data
structure. This count is printed by the show_rcu_gp_kthreads() that
is invoked by rcutorture and the RCU CPU stall-warning code. It is also
intended for use by drgn.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# abfce041 19-Apr-2020 Wei Yang <richard.weiyang@gmail.com>

rcu: Simplify the calculation of rcu_state.ncpus

There is only 1 bit set in mask, which means that the only difference
between oldmask and the new one will be at the position where the bit is
set in mask. This commit therefore updates rcu_state.ncpus by checking
whether the bit in mask is already set in rnp->expmaskinitnext.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b58e733f 15-Jun-2020 Peter Zijlstra <peterz@infradead.org>

rcu: Fixup noinstr warnings

A KCSAN build revealed we have explicit annoations through atomic_*()
usage, switch to arch_atomic_*() for the respective functions.

vmlinux.o: warning: objtool: rcu_nmi_exit()+0x4d: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x25: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x4f: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0x2a: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: __rcu_is_watching()+0x25: call to __kcsan_check_access() leaves .noinstr.text section

Additionally, without the NOP in instrumentation_begin(), objtool would
not detect the lack of the 'else instrumentation_begin();' branch in
rcu_nmi_enter().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 806f04e9 27-May-2020 Peter Zijlstra <peterz@infradead.org>

rcu: Allow for smp_call_function() running callbacks from idle

Current RCU hard relies on smp_call_function() callbacks running from
interrupt context. A pending optimization is going to break that, it
will allow idle CPUs to run the callbacks from the idle loop. This
avoids raising the IPI on the requesting CPU and avoids handling an
exception on the receiving CPU.

Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
provided it is the idle task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20200527171236.GC706495@hirez.programming.kicks-ass.net


# 07325d4a 21-May-2020 Thomas Gleixner <tglx@linutronix.de>

rcu: Provide rcu_irq_exit_check_preempt()

Provide a debug check which can be invoked from exception return to kernel
mode before an attempt is made to schedule. Warn if RCU is not ready for
this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20200521202117.089709607@linutronix.de


# aaf2bc50 21-May-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()

There will likely be exception handlers that can sleep, which rules
out the usual approach of invoking rcu_nmi_enter() on entry and also
rcu_nmi_exit() on all exit paths. However, the alternative approach of
just not calling anything can prevent RCU from coaxing quiescent states
from nohz_full CPUs that are looping in the kernel: RCU must instead
IPI them explicitly. It would be better to enable the scheduler tick
on such CPUs to interact with RCU in a lighter-weight manner, and this
enabling is one of the things that rcu_nmi_enter() currently does.

What is needed is something that helps RCU coax quiescent states while
not preventing subsequent sleeps. This commit therefore splits out the
nohz_full scheduler-tick enabling from the rest of the rcu_nmi_enter()
logic into a new function named rcu_irq_enter_check_tick().

[ tglx: Renamed the function and made it a nop when context tracking is off ]
[ mingo: Fixed a CONFIG_NO_HZ_FULL assumption, harmonized and fixed all the
comment blocks and cleaned up rcu_nmi_enter()/exit() definitions. ]

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200521202116.996113173@linutronix.de


# b1fcf9b8 12-May-2020 Thomas Gleixner <tglx@linutronix.de>

rcu: Provide __rcu_is_watching()

Same as rcu_is_watching() but without the preempt_disable/enable() pair
inside the function. It is merked noinstr so it ends up in the
non-instrumentable text section.

This is useful for non-preemptible code especially in the low level entry
section. Using rcu_is_watching() there results in a call to the
preempt_schedule_notrace() thunk which triggers noinstr section warnings in
objtool.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512213810.518709291@linutronix.de


# 8ae0ae67 03-May-2020 Thomas Gleixner <tglx@linutronix.de>

rcu: Provide rcu_irq_exit_preempt()

Interrupts and exceptions invoke rcu_irq_enter() on entry and need to
invoke rcu_irq_exit() before they either return to the interrupted code or
invoke the scheduler due to preemption.

The general assumption is that RCU idle code has to have preemption
disabled so that a return from interrupt cannot schedule. So the return
from interrupt code invokes rcu_irq_exit() and preempt_schedule_irq().

If there is any imbalance in the rcu_irq/nmi* invocations or RCU idle code
had preemption enabled then this goes unnoticed until the CPU goes idle or
some other RCU check is executed.

Provide rcu_irq_exit_preempt() which can be invoked from the
interrupt/exception return code in case that preemption is enabled. It
invokes rcu_irq_exit() and contains a few sanity checks in case that
CONFIG_PROVE_RCU is enabled to catch such issues directly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134904.364456424@linutronix.de


# 9ea366f6 13-Feb-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Make RCU IRQ enter/exit functions rely on in_nmi()

The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an
"irq" parameter that indicates whether these functions have been invoked from
an irq handler (irq==true) or an NMI handler (irq==false).

However, recent changes have applied notrace to a few critical functions
such that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely on
in_nmi(). Note that in_nmi() works no differently than before, but rather
that tracing is now prohibited in code regions where in_nmi() would
incorrectly report NMI state.

Therefore remove the "irq" parameter and inline rcu_nmi_enter_common() and
rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(),
respectively.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.617130349@linutronix.de


# ff5c4f5c 13-Mar-2020 Thomas Gleixner <tglx@linutronix.de>

rcu/tree: Mark the idle relevant functions noinstr

These functions are invoked from context tracking and other places in the
low level entry code. Move them into the .noinstr.text section to exclude
them from instrumentation.

Mark the places which are safe to invoke traceable functions with
instrumentation_begin/end() so objtool won't complain.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20200505134100.575356107@linutronix.de


# 55b2dcf5 01-Apr-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Allow rcutorture to starve grace-period kthread

This commit provides an rcutorture.stall_gp_kthread module parameter
to allow rcutorture to starve the grace-period kthread. This allows
testing the code that detects such starvation.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7d0c9c50 19-Mar-2020 Paul E. McKenney <paulmck@kernel.org>

rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built

Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks. Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode. Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.

Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU. This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task. And that task could switch from one CPU to
another at any time.

This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn. If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.

With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ac3caf82 12-Mar-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add comments marking transitions between RCU watching and not

It is not as clear as it might be just where in RCU's idle entry/exit
code RCU stops and starts watching the current CPU. This commit therefore
adds comments calling out the transitions.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a6a82ce1 15-Mar-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Count number of batched kfree_rcu() locklessly

We can relax the correctness of counting of number of queued objects in
favor of not hurting performance, by locklessly sampling per-cpu
counters. This should be Ok since under high memory pressure, it should not
matter if we are off by a few objects while counting. The shrinker will
still do the reclaim.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Remove unused "flags" variable. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 9154244c 15-Mar-2020 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu/tree: Add a shrinker to prevent OOM due to kfree_rcu() batching

To reduce grace periods and improve kfree() performance, we have done
batching recently dramatically bringing down the number of grace periods
while giving us the ability to use kfree_bulk() for efficient kfree'ing.

However, this has increased the likelihood of OOM condition under heavy
kfree_rcu() flood on small memory systems. This patch introduces a
shrinker which starts grace periods right away if the system is under
memory pressure due to existence of objects that have still not started
a grace period.

With this patch, I do not observe an OOM anymore on a system with 512MB
RAM and 8 CPUs, with the following rcuperf options:

rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000
rcuperf.kfree_rcu_test=1 rcuperf.kfree_mult=2

Otherwise it easily OOMs with the above parameters.

NOTE:
1. On systems with no memory pressure, the patch has no effect as intended.
2. In the future, we can use this same mechanism to prevent grace periods
from happening even more, by relying on shrinkers carefully.

Cc: urezki@gmail.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 29ffebc5 10-Apr-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert ULONG_CMP_GE() to time_after() for jiffy comparison

This commit converts the ULONG_CMP_GE() in rcu_gp_fqs_loop() to
time_after() to reflect the fact that it is comparing a timestamp to
the jiffies counter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# da44cd6c 29-Mar-2020 Jules Irenge <jbi.octave@gmail.com>

rcu: Replace 1 by true

Coccinelle reports a warning at use_softirq declaration

WARNING: Assignment of 0/1 to bool variable

The root cause is
use_softirq a variable of bool type is initialised with the integer 1
Replacing 1 with value true solve the issue.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 62ae1951 21-Mar-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Mark rcu_state.gp_seq to detect more concurrent writes

The rcu_state structure's gp_seq field is only to be modified by the RCU
grace-period kthread, which is single-threaded. This commit therefore
enlists KCSAN's help in enforcing this restriction. This commit applies
KCSAN-specific primitives, so cannot go upstream until KCSAN does.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 1fca4d12 22-Feb-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Expedite first two FQS scans under callback-overload conditions

Even if some CPUs have excessive numbers of callbacks, RCU's grace-period
kthread will still wait normally between successive force-quiescent-state
scans. The first two are the most important, as they are the ones that
enlist aid from the scheduler when overloaded. This commit therefore
omits the wait before the first and the second force-quiescent-state
scan under callback-overload conditions.

This approach was inspired by a discussion with Jeff Roberson.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2f084695 10-Feb-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Mark rcu_state.ncpus to detect concurrent writes

The rcu_state structure's ncpus field is only to be modified by the
CPU-hotplug CPU-online code path, which is single-threaded. This
commit therefore enlists KCSAN's help in enforcing this restriction.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 35315936 13-Apr-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add KCSAN stubs

This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to
move ahead.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# bf37da98 12-Mar-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't acquire lock in NMI handler in rcu_nmi_enter_common()

The rcu_nmi_enter_common() function can be invoked both in interrupt
and NMI handlers. If it is invoked from process context (as opposed
to userspace or idle context) on a nohz_full CPU, it might acquire the
CPU's leaf rcu_node structure's ->lock. Because this lock is held only
with interrupts disabled, this is safe from an interrupt handler, but
doing so from an NMI handler can result in self-deadlock.

This commit therefore adds "irq" to the "if" condition so as to only
acquire the ->lock from irq handlers or process context, never from
an NMI handler.

Fixes: 5b14557b073c ("rcu: Avoid tick_dep_set_cpu() misordering")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org> # 5.5.x


# 127e2981 11-Feb-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_barrier() account for offline no-CBs CPUs

Currently, rcu_barrier() ignores offline CPUs, However, it is possible
for an offline no-CBs CPU to have callbacks queued, and rcu_barrier()
must wait for those callbacks. This commit therefore makes rcu_barrier()
directly invoke the rcu_barrier_func() with interrupts disabled for such
CPUs. This requires passing the CPU number into this function so that
it can entrain the rcu_barrier() callback onto the correct CPU's callback
list, given that the code must instead execute on the current CPU.

While in the area, this commit fixes a bug where the first CPU's callback
might have been invoked before rcu_segcblist_entrain() returned, which
would also result in an early wakeup.

Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Apply optimization feedback from Boqun Feng. ]
Cc: <stable@vger.kernel.org> # 5.5.x


# 0f11ad32 10-Feb-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Mark rcu_state.gp_seq to detect concurrent writes

The rcu_state structure's gp_seq field is only to be modified by the RCU
grace-period kthread, which is single-threaded. This commit therefore
enlists KCSAN's help in enforcing this restriction.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 49915ac3 20-Mar-2020 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

lockdep: Annotate irq_work

Mark irq_work items with IRQ_WORK_HARD_IRQ which should be invoked in
hardirq context even on PREEMPT_RT. IRQ_WORK without this flag will be
invoked in softirq context on PREEMPT_RT.

Set ->irq_config to 1 for the IRQ_WORK items which are invoked in softirq
context so lockdep knows that these can safely acquire a spinlock_t.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.643576700@linutronix.de


# b692dc4a 11-Feb-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Update __call_rcu() comments

The __call_rcu() function's header comment refers to a cpu argument
that no longer exists, and the comment of the return path from
rcu_nocb_try_bypass() ignores the non-no-CBs CPU case. This commit
therefore update both comments.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b2b00ddf 30-Oct-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: React to callback overload by aggressively seeking quiescent states

In default configutions, RCU currently waits at least 100 milliseconds
before asking cond_resched() and/or resched_rcu() for help seeking
quiescent states to end a grace period. But 100 milliseconds can be
one good long time during an RCU callback flood, for example, as can
happen when user processes repeatedly open and close files in a tight
loop. These 100-millisecond gaps in successive grace periods during a
callback flood can result in excessive numbers of callbacks piling up,
unnecessarily increasing memory footprint.

This commit therefore asks cond_resched() and/or resched_rcu() for help
as early as the first FQS scan when at least one of the CPUs has more
than 20,000 callbacks queued, a number that can be changed using the new
rcutree.qovld kernel boot parameter. An auxiliary qovld_calc variable
is used to avoid acquisition of locks that have not yet been initialized.
Early tests indicate that this reduces the RCU-callback memory footprint
during rcutorture floods by from 50% to 4x, depending on configuration.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Tejun Heo <tj@kernel.org>
[ paulmck: Fix bug located by Qian Cai. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Dexuan Cui <decui@microsoft.com>
Tested-by: Qian Cai <cai@lca.pw>


# b5ea0370 09-Dec-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Clear ->core_needs_qs at GP end or self-reported QS

The rcu_data structure's ->core_needs_qs field does not necessarily get
cleared in a timely fashion after the corresponding CPUs' quiescent state
has been reported. From a functional viewpoint, no harm done, but this
can result in excessive invocation of RCU core processing, as witnessed
by the kernel test robot, which saw greatly increased softirq overhead.

This commit therefore restores the rcu_report_qs_rdp() function's
clearing of this field, but only when running on the corresponding CPU.
Cases where some other CPU reports the quiescent state (for example, on
behalf of an idle CPU) are handled by setting this field appropriately
within the __note_gp_changes() function's end-of-grace-period checks.
This handling is carried out regardless of whether the end of a grace
period actually happened, thus handling the case where a CPU goes non-idle
after a quiescent state is reported on its behalf, but before the grace
period ends. This fix also avoids cross-CPU updates to ->core_needs_qs,

While in the area, this commit changes the __note_gp_changes() need_gp
variable's name to need_qs because it is a quiescent state that is needed
from the CPU in question.

Fixes: ed93dfc6bc00 ("rcu: Confine ->core_needs_qs accesses to the corresponding CPU")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 61370792 20-Jan-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu: Add a trace event for kfree_rcu() use of kfree_bulk()

The event is given three parameters, first one is the name
of RCU flavour, second one is the number of elements in array
for free and last one is an address of the array holding
pointers to be freed by the kfree_bulk() function.

To enable the trace event your kernel has to be build with
CONFIG_RCU_TRACE=y, after that it is possible to track the
events using ftrace subsystem.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 34c88174 20-Jan-2020 Uladzislau Rezki (Sony) <urezki@gmail.com>

rcu: Support kfree_bulk() interface in kfree_rcu()

The kfree_rcu() logic can be improved further by using kfree_bulk()
interface along with "basic batching support" introduced earlier.

The are at least two advantages of using "bulk" interface:
- in case of large number of kfree_rcu() requests kfree_bulk()
reduces the per-object overhead caused by calling kfree()
per-object.

- reduces the number of cache-misses due to "pointer chasing"
between objects which can be far spread between each other.

This approach defines a new kfree_rcu_bulk_data structure that
stores pointers in an array with a specific size. Number of entries
in that array depends on PAGE_SIZE making kfree_rcu_bulk_data
structure to be exactly one page.

Since it deals with "block-chain" technique there is an extra
need in dynamic allocation when a new block is required. Memory
is allocated with GFP_NOWAIT | __GFP_NOWARN flags, i.e. that
allows to skip direct reclaim under low memory condition to
prevent stalling and fails silently under high memory pressure.

The "emergency path" gets maintained when a system is run out of
memory. In that case objects are linked into regular list.

The "rcuperf" was run to analyze this change in terms of memory
consumption and kfree_bulk() throughput.

1) Testing on the Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, 12xCPUs
with following parameters:

kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1 kfree_vary_obj_size=1
dev.2020.01.10a branch

Default / CONFIG_SLAB
53607352517 ns, loops: 200000, batches: 1885, memory footprint: 1248MB
53529637912 ns, loops: 200000, batches: 1921, memory footprint: 1193MB
53570175705 ns, loops: 200000, batches: 1929, memory footprint: 1250MB

Patch / CONFIG_SLAB
23981587315 ns, loops: 200000, batches: 810, memory footprint: 1219MB
23879375281 ns, loops: 200000, batches: 822, memory footprint: 1190MB
24086841707 ns, loops: 200000, batches: 794, memory footprint: 1380MB

Default / CONFIG_SLUB
51291025022 ns, loops: 200000, batches: 1713, memory footprint: 741MB
51278911477 ns, loops: 200000, batches: 1671, memory footprint: 719MB
51256183045 ns, loops: 200000, batches: 1719, memory footprint: 647MB

Patch / CONFIG_SLUB
50709919132 ns, loops: 200000, batches: 1618, memory footprint: 456MB
50736297452 ns, loops: 200000, batches: 1633, memory footprint: 507MB
50660403893 ns, loops: 200000, batches: 1628, memory footprint: 429MB

in case of CONFIG_SLAB there is double increase in performance and
slightly higher memory usage. As for CONFIG_SLUB, the performance
figures are better together with lower memory usage.

2) Testing on the HiKey-960, arm64, 8xCPUs with below parameters:

CONFIG_SLAB=y
kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1

102898760401 ns, loops: 200000, batches: 5822, memory footprint: 158MB
89947009882 ns, loops: 200000, batches: 6715, memory footprint: 115MB

rcuperf shows approximately ~12% better throughput in case of
using "bulk" interface. The "drain logic" or its RCU callback
does the work faster that leads to better throughput.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# faa059c3 03-Feb-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Optimize and protect atomic_cmpxchg() loop

This commit reworks the atomic_cmpxchg() loop in rcu_eqs_special_set()
to do only the initial read from the current CPU's rcu_data structure's
->dynticks field explicitly. On subsequent passes, this value is instead
retained from the failing atomic_cmpxchg() operation.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5648d659 21-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't flag non-starting GPs before GP kthread is running

Currently rcu_check_gp_start_stall() complains if a grace period takes
too long to start, where "too long" is roughly one RCU CPU stall-warning
interval. This has worked well, but there are some debugging Kconfig
options (such as CONFIG_EFI_PGT_DUMP=y) that can make booting take a
very long time, so much so that the stall-warning interval has expired
before RCU's grace-period kthread has even been spawned.

This commit therefore resets the rcu_state.gp_req_activity and
rcu_state.gp_activity timestamps just before the grace-period kthread
is spawned, and modifies the checks and adds ordering to ensure that
if rcu_check_gp_start_stall() sees that the grace-period kthread
has been spawned, that it will also see the resets applied to the
rcu_state.gp_req_activity and rcu_state.gp_activity timestamps.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Fix whitespace issues reported by Qian Cai. ]
Tested-by: Qian Cai <cai@lca.pw>
[ paulmck: Simplify grace-period wakeup check per Steve Rostedt feedback. ]


# aa24f937 20-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix rcu_barrier_callback() race condition

The rcu_barrier_callback() function does an atomic_dec_and_test(), and
if it is the last CPU to check in, does the required wakeup. Either way,
it does an event trace. Unfortunately, this is susceptible to the
following sequence of events:

o CPU 0 invokes rcu_barrier_callback(), but atomic_dec_and_test()
says that it is not last. But at this point, CPU 0 is delayed,
perhaps due to an NMI, SMI, or vCPU preemption.

o CPU 1 invokes rcu_barrier_callback(), and atomic_dec_and_test()
says that it is last. So CPU 1 traces completion and does
the needed wakeup.

o The awakened rcu_barrier() function does cleanup and releases
rcu_state.barrier_mutex.

o Another CPU now acquires rcu_state.barrier_mutex and starts
another round of rcu_barrier() processing, including updating
rcu_state.barrier_sequence.

o CPU 0 gets its act back together and does its tracing. Except
that rcu_state.barrier_sequence has already been updated, so
its tracing is incorrect and probably quite confusing.
(Wait! Why did this CPU check in twice for one rcu_barrier()
invocation???)

This commit therefore causes rcu_barrier_callback() to take a
snapshot of the value of rcu_state.barrier_sequence before invoking
atomic_dec_and_test(), thus guaranteeing that the event-trace output
is sensible, even if the timing of the event-trace output might still
be confusing. (Wait! Why did the old rcu_barrier() complete before
all of its CPUs checked in???) But being that this is RCU, only so much
confusion can reasonably be eliminated.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely and due to the mild consequences of the
failure, namely a confusing event trace.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2a2ae872 08-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add *_ONCE() to rcu_data ->rcu_forced_tick

The rcu_data structure's ->rcu_forced_tick field is read locklessly, so
this commit adds WRITE_ONCE() to all updates and READ_ONCE() to all
lockless reads.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a5b89501 07-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add READ_ONCE() to rcu_data ->gpwrap

The rcu_data structure's ->gpwrap field is read locklessly, and so
this commit adds the required READ_ONCE() to a pair of laods in order
to avoid destructive compiler optimizations.

This data race was reported by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 65bb0dc4 06-Jan-2020 SeongJae Park <sjpark@amazon.de>

rcu: Fix typos in file-header comments

Convert to plural and add a note that this is for Tree RCU.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 8ff37290 04-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add *_ONCE() for grace-period progress indicators

The various RCU structures' ->gp_seq, ->gp_seq_needed, ->gp_req_activity,
and ->gp_activity fields are read locklessly, so they must be updated with
WRITE_ONCE() and, when read locklessly, with READ_ONCE(). This commit makes
these changes.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 105abf82 03-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add WRITE_ONCE() to rcu_node ->qsmaskinitnext

The rcu_state structure's ->qsmaskinitnext field is read locklessly,
so this commit adds the WRITE_ONCE() to an update in order to provide
proper documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely for systems not doing incessant CPU-hotplug
operations.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2906d215 03-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add WRITE_ONCE() to rcu_state ->gp_req_activity

The rcu_state structure's ->gp_req_activity field is read locklessly,
so this commit adds the WRITE_ONCE() to an update in order to provide
proper documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 0937d045 03-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add READ_ONCE() to rcu_node ->gp_seq

The rcu_node structure's ->gp_seq field is read locklessly, so this
commit adds the READ_ONCE() to several loads in order to avoid
destructive compiler optimizations.

This data race was reported by KCSAN. Not appropriate for backporting
because this affects only tracing and warnings.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7672d647 03-Jan-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Add WRITE_ONCE() to rcu_node ->qsmask update

The rcu_node structure's ->qsmask field is read locklessly, so this
commit adds the WRITE_ONCE() to an update in order to provide proper
documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# f6105fc2 27-Nov-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove unused stop-machine #include

Long ago, RCU used the stop-machine mechanism to implement expedited
grace periods, but no longer does so. This commit therefore removes
the no-longer-needed #includes of linux/stop_machine.h.

Link: https://lwn.net/Articles/805317/
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 7441e766 30-Oct-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Switch force_qs_rnp() to for_each_leaf_node_cpu_mask()

Currently, force_qs_rnp() uses a for_each_leaf_node_possible_cpu()
loop containing a check of the current CPU's bit in ->qsmask.
This works, but this commit saves three lines by instead using
for_each_leaf_node_cpu_mask(), which combines the functionality of
for_each_leaf_node_possible_cpu() and leaf_node_cpu_bit(). This commit
also replaces the use of the local variable "bit" with rdp->grpmask.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e2167b38 15-Oct-2019 Lai Jiangshan <jiangshanlai@gmail.com>

rcu: Move gp_state_names[] and gp_state_getname() to tree_stall.h

Only tree_stall.h needs to get name from GP state, so this commit
moves the gp_state_names[] array and the gp_state_getname()
from kernel/rcu/tree.h and kernel/rcu/tree.c, respectively, to
kernel/rcu/tree_stall.h. While moving gp_state_names[], this commit
uses the GCC syntax to ensure that the right string is associated with
the right CPP macro.

Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 2488a5e6 15-Oct-2019 Lai Jiangshan <jiangshanlai@gmail.com>

rcu: Fix tracepoint tracking RCU CPU kthread utilization

In the call to trace_rcu_utilization() at the start of the loop in
rcu_cpu_kthread(), "rcu_wait" is incorrect, plus this trace event needs
to be hoisted above the loop to balance with either the "rcu_wait" or
"rcu_yield", depending on how the loop exits. This commit therefore
makes these changes.

Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 5b14557b 26-Nov-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid tick_dep_set_cpu() misordering

In the current code, rcu_nmi_enter_common() might decide to turn on
the tick using tick_dep_set_cpu(), but be delayed just before doing so.
Then the grace-period kthread might notice that the CPU in question had
in fact gone through a quiescent state, thus turning off the tick using
tick_dep_clear_cpu(). The later invocation of tick_dep_set_cpu() would
then incorrectly leave the tick on.

This commit therefore enlists the aid of the leaf rcu_node structure's
->lock to ensure that decisions to enable or disable the tick are
carried out before they can be reversed.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c130d2dc 15-Oct-2019 Lai Jiangshan <jiangshanlai@gmail.com>

rcu: Rename some instance of CONFIG_PREEMPTION to CONFIG_PREEMPT_RCU

CONFIG_PREEMPTION and CONFIG_PREEMPT_RCU are always identical,
but some code depends on CONFIG_PREEMPTION to access to
rcu_preempt functionality. This patch changes CONFIG_PREEMPTION
to CONFIG_PREEMPT_RCU in these cases.

Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 189a6883 29-Aug-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Remove kfree_call_rcu_nobatch()

Now that the kfree_rcu() special-casing has been removed from tree RCU,
this commit removes kfree_call_rcu_nobatch() since it is no longer needed.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 77a40f97 29-Aug-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Remove kfree_rcu() special casing and lazy-callback handling

This commit removes kfree_rcu() special-casing and the lazy-callback
handling from Tree RCU. It moves some of this special casing to Tiny RCU,
the removal of which will be the subject of later commits.

This results in a nice negative delta.

Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Add slab.h #include, thanks to kbuild test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# e99637be 22-Sep-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Add support for debug_objects debugging for kfree_rcu()

This commit applies RCU's debug_objects debugging to the new batched
kfree_rcu() implementations. The object is queued at the kfree_rcu()
call and dequeued during reclaim.

Tested that enabling CONFIG_DEBUG_OBJECTS_RCU_HEAD successfully detects
double kfree_rcu() calls.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Fix IRQ per kbuild test robot <lkp@intel.com> feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 0392bebe 19-Sep-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Add multiple in-flight batches of kfree_rcu() work

During testing, it was observed that amount of memory consumed due
kfree_rcu() batching is 300-400MB. Previously we had only a single
head_free pointer pointing to the list of rcu_head(s) that are to be
freed after a grace period. Until this list is drained, we cannot queue
any more objects on it since such objects may not be ready to be
reclaimed when the worker thread eventually gets to drainin g the
head_free list.

We can do better by maintaining multiple lists as done by this patch.
Testing shows that memory consumption came down by around 100-150MB with
just adding another list. Adding more than 1 additional list did not
show any improvement.

Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Code style and initialization handling. ]
[ paulmck: Fix field name, reported by kbuild test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 569d7670 22-Sep-2019 Joel Fernandes <joel@joelfernandes.org>

rcu: Make kfree_rcu() use a non-atomic ->monitor_todo

Because the ->monitor_todo field is always protected by krcp->lock,
this commit downgrades from xchg() to non-atomic unmarked assignment
statements.

Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
[ paulmck: Update to include early-boot kick code. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# a35d1690 05-Aug-2019 Byungchul Park <byungchul.park@lge.com>

rcu: Add basic support for kfree_rcu() batching

Recently a discussion about stability and performance of a system
involving a high rate of kfree_rcu() calls surfaced on the list [1]
which led to another discussion how to prepare for this situation.

This patch adds basic batching support for kfree_rcu(). It is "basic"
because we do none of the slab management, dynamic allocation, code
moving or any of the other things, some of which previous attempts did
[2]. These fancier improvements can be follow-up patches and there are
different ideas being discussed in those regards. This is an effort to
start simple, and build up from there. In the future, an extension to
use kfree_bulk and possibly per-slab batching could be done to further
improve performance due to cache-locality and slab-specific bulk free
optimizations. By using an array of pointers, the worker thread
processing the work would need to read lesser data since it does not
need to deal with large rcu_head(s) any longer.

Torture tests follow in the next patch and show improvements of around
5x reduction in number of grace periods on a 16 CPU system. More
details and test data are in that patch.

There is an implication with rcu_barrier() with this patch. Since the
kfree_rcu() calls can be batched, and may not be handed yet to the RCU
machinery in fact, the monitor may not have even run yet to do the
queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
to wait for those kfree_rcu()s that are already made. So this means a
kfree_rcu() followed by an rcu_barrier() does not imply that memory will
be freed once rcu_barrier() returns.

Another implication is higher active memory usage (although not
run-away..) until the kfree_rcu() flooding ends, in comparison to
without batching. More details about this are in the second patch which
adds an rcuperf test.

Finally, in the near future we will get rid of kfree_rcu() special casing
within RCU such as in rcu_do_batch and switch everything to just
batching. Currently we don't do that since timer subsystem is not yet up
and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
lock are not initialized. That would also mean getting rid of
kfree_call_rcu_nobatch() entirely.

[1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-mst@kernel.org
[2] https://lkml.org/lkml/2017/12/19/824

Cc: kernel-team@android.com
Cc: kernel-team@lge.com
Co-developed-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Applied 0day and Paul Walmsley feedback on ->monitor_todo. ]
[ paulmck: Make it work during early boot. ]
[ paulmck: Add a crude early boot self-test. ]
[ paulmck: Style adjustments and experimental docbook structure header. ]
Link: https://lore.kernel.org/lkml/alpine.DEB.2.21.9999.1908161931110.32497@viisi.sifive.com/T/#me9956f66cb611b95d26ae92700e1d901f46e8c59
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# c30fe541 11-Oct-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Mark non-global functions and variables as static

Each of rcu_state, rcu_rnp_online_cpus(), rcu_dynticks_curr_cpu_in_eqs(),
and rcu_dynticks_snap() are used only in the kernel/rcu/tree.o translation
unit, and may thus be marked static. This commit therefore makes this
change.

Reported-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# 90326f05 15-Oct-2019 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

rcu: Use CONFIG_PREEMPTION where appropriate

The config option `CONFIG_PREEMPT' is used for the preemption model
"Low-Latency Desktop". The config option `CONFIG_PREEMPTION' is enabled
when kernel preemption is enabled which is true for the preemption model
`CONFIG_PREEMPT' and `CONFIG_PREEMPT_RT'.

Use `CONFIG_PREEMPTION' if it applies to both preemption models and not
just to `CONFIG_PREEMPT'.

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6cf539a8 09-Oct-2019 Marco Elver <elver@google.com>

rcu: Fix data-race due to atomic_t copy-by-value

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
kthread+0x1b5/0x200 kernel/kthread.c:255
<snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
irq_enter+0x5/0x50 kernel/softirq.c:347
<snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 05ef9e9e 15-Aug-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Ensure that ->rcu_urgent_qs is set before resched IPI

The RCU-specific resched_cpu() function sends a resched IPI to the
specified CPU, which can be used to force the tick on for a given
nohz_full CPU. This is needed when this nohz_full CPU is looping in the
kernel while blocking the current grace period. However, for the tick
to actually be forced on in all cases, that CPU's rcu_data structure's
->rcu_urgent_qs flag must be set beforehand. This commit therefore
causes rcu_implicit_dynticks_qs() to set this flag prior to invoking
resched_cpu() on a holdout nohz_full CPU.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# dd7dafd1 14-Sep-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Make kernel-mode nohz_full CPUs invoke the RCU core processing

If a nohz_full CPU is idle or executing in userspace, it makes good sense
to keep it out of RCU core processing. After all, the RCU grace-period
kthread can see its quiescent states and all of its callbacks are
offloaded, so there is nothing for RCU core processing to do.

However, if a nohz_full CPU is executing in kernel space, the RCU
grace-period kthread cannot do anything for it, so such a CPU must report
its own quiescent states. This commit therefore makes nohz_full CPUs
skip RCU core processing only if the scheduler-clock interrupt caught
them in idle or in userspace.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# ed93dfc6 13-Sep-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Confine ->core_needs_qs accesses to the corresponding CPU

Commit 671a63517cf9 ("rcu: Avoid unnecessary softirq when system
is idle") fixed a bug that could result in an indefinite number of
unnecessary invocations of the RCU_SOFTIRQ handler at the trailing edge
of a scheduler-clock interrupt. However, the fix introduced off-CPU
stores to ->core_needs_qs. These writes did not conflict with the
on-CPU stores because the CPU's leaf rcu_node structure's ->lock was
held across all such stores. However, the loads from ->core_needs_qs
were not promoted to READ_ONCE() and, worse yet, the code loading from
->core_needs_qs was written assuming that it was only ever updated by
the corresponding CPU. So operation has been robust, but only by luck.
This situation is therefore an accident waiting to happen.

This commit therefore takes a different approach. Instead of clearing
->core_needs_qs from the grace-period kthread's force-quiescent-state
processing, it modifies the rcu_pending() function to suppress the
rcu_sched_clock_irq() function's call to invoke_rcu_core() if there is no
grace period in progress. This avoids the infinite needless RCU_SOFTIRQ
handlers while still keeping all accesses to ->core_needs_qs local to
the corresponding CPU.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 516e5ae0 05-Sep-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Reset CPU hints when reporting a quiescent state

In some cases, tracing shows that need_heavy_qs is still set even though
urgent_qs was cleared upon reporting of a quiescent state. One such
case is when the softirq reports that a CPU has passed quiescent state.

Commit 671a63517cf9 ("rcu: Avoid unnecessary softirq when system is
idle") fixed a bug where core_needs_qs was not being cleared. In order
to avoid running into similar situations with the urgent-grace-period
flags, this commit causes rcu_disable_urgency_upon_qs(), previously
rcu_disable_tick_upon_qs(), to clear the urgency hints, ->rcu_urgent_qs
and ->rcu_need_heavy_qs. Note that it is possible for CPUs to go
offline with these urgency hints still set. This is handled because
rcu_disable_urgency_upon_qs() is also invoked during the online process.

Because these hints can be cleared both by the corresponding CPU and by
the grace-period kthread, this commit also adds a number of READ_ONCE()
and WRITE_ONCE() calls.

Tested overnight with rcutorture running for 60 minutes on all
configurations of RCU.

Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
[ paulmck: Clear urgency flags in rcu_disable_urgency_upon_qs(). ]
[ paulmck: Remove ->core_needs_qs from the set cleared at quiescent state. ]
[ paulmck: Make rcu_disable_urgency_upon_qs static per kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# b200a048 15-Aug-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Force nohz_full tick on upon irq enter instead of exit

There is interrupt-exit code that forces on the tick for nohz_full CPUs
failing to respond to the current grace period in a timely fashion.
However, this code must compare ->dynticks_nmi_nesting to the value 2
in the interrupt-exit fastpath. This commit therefore moves this code
to the interrupt-entry fastpath, where a lighter-weight comparison to
zero may be used.

Reported-by: Joel Fernandes <joel@joelfernandes.org>
[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 66e4c33b 12-Aug-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Force tick on for nohz_full CPUs not reaching quiescent states

CPUs running for long time periods in the kernel in nohz_full mode
might leave the scheduling-clock interrupt disabled for then full
duration of their in-kernel execution. This can (among other things)
delay grace periods. This commit therefore forces the tick back on
for any nohz_full CPU that is failing to pass through a quiescent state
upon return from interrupt, which the resched_cpu() will induce.

Reported-by: Joel Fernandes <joel@joelfernandes.org>
[ paulmck: Clear ->rcu_forced_tick as reported by Joel Fernandes testing. ]
[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 79ba7ff5 04-Aug-2019 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Emulate dyntick aspect of userspace nohz_full sojourn

During an actual call_rcu() flood, there would be frequent trips to
userspace (in-kernel call_rcu() floods must be otherwise housebroken).
Userspace execution on nohz_full CPUs implies an RCU dyntick idle/not-idle
transition pair, so this commit adds emulation of that pair.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 96926686 02-Aug-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Make CPU-hotplug removal operations enable tick

CPU-hotplug removal operations run the multi_cpu_stop() function, which
relies on the scheduler to gain control from whatever is running on the
various online CPUs, including any nohz_full CPUs running long loops in
kernel-mode code. Lack of the scheduler-clock interrupt on such CPUs
can delay multi_cpu_stop() for several minutes and can also result in
RCU CPU stall warnings. This commit therefore causes CPU-hotplug removal
operations to enable the scheduler-clock interrupt on all online CPUs.

[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
[ paulmck: Apply simplifications suggested by Frederic Weisbecker. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 366237e7 10-Jul-2019 Paul E. McKenney <paulmck@kernel.org>

stop_machine: Provide RCU quiescent state in multi_cpu_stop()

When multi_cpu_stop() loops waiting for other tasks, it can trigger an RCU
CPU stall warning. This can be misleading because what is instead needed
is information on whatever task is blocking multi_cpu_stop(). This commit
therefore inserts an RCU quiescent state into the multi_cpu_stop()
function's waitloop.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# 6a949b7a 28-Jul-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Force on tick when invoking lots of callbacks

Callback invocation can run for a significant time period, and within
CONFIG_NO_HZ_FULL=y kernels, this period will be devoid of scheduler-clock
interrupts. In-kernel execution without such interrupts can cause all
manner of malfunction, with RCU CPU stall warnings being but one result.

This commit therefore forces scheduling-clock interrupts on whenever more
than a few RCU callbacks are invoked. Because offloaded callback invocation
can be preempted, this forcing is withdrawn on each context switch. This
in turn requires that the loop invoking RCU callbacks reiterate the forcing
periodically.

[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
[ paulmck: Remove NO_HZ_FULL check per Frederic Weisbecker feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>


# cfcdef5e 24-Jul-2019 Eric Dumazet <edumazet@google.com>

rcu: Allow rcu_do_batch() to dynamically adjust batch sizes

Bimodal behavior of rcu_do_batch() is not really suited to Google
applications like gfe servers.

When a process with millions of sockets exits, closing all files
queues two rcu callbacks per socket.

This eventually reaches the point where RCU enters an emergency
mode, where rcu_do_batch() do not return until whole queue is flushed.

Each rcu callback lasts at least 70 nsec, so with millions of
elements, we easily spend more than 100 msec without rescheduling.

Goal of this patch is to avoid the infamous message like following
"need_resched set for > 51999388 ns (52 ticks) without schedule"

We dynamically adjust the number of elements we process, instead
of 10 / INFINITE choices, we use a floor of ~1 % of current entries.

If the number is above 1000, we switch to a time based limit of 3 msec
per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns

Signed-off-by: Eric Dumazet <edumazet@google.com>
[ paulmck: Forward-port and remove debug statements. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 23651d9b 10-Jul-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()

The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the
offlined CPU's ->cblist and that of the surviving CPU, then merges
them. However, after the merge, and of the offlined CPU's callbacks
that were not ready to be invoked will no longer be associated with a
grace-period number. This commit therefore invokes rcu_advance_cbs()
one more time on the merged ->cblist in order to assign a grace-period
number to these callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# d1b222c6 02-Jul-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Add bypass callback queueing

Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.

This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.

The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.

[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 6608c3a0 01-Jun-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Reduce contention at no-CBs registry-time CB advancement

Currently, __call_rcu_nocb_wake() conditionally acquires the leaf rcu_node
structure's ->lock, and only afterwards does rcu_advance_cbs_nowake()
check to see if it is possible to advance callbacks without potentially
needing to awaken the grace-period kthread. Given that the no-awaken
check can be done locklessly, this commit reverses the order, so that
rcu_advance_cbs_nowake() is invoked without holding the leaf rcu_node
structure's ->lock and rcu_advance_cbs_nowake() checks the grace-period
state before conditionally acquiring that lock, thus reducing the number
of needless acquistions of the leaf rcu_node structure's ->lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 7f36ef82 28-May-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread

Currently, the code provides an extra wakeup for the no-CBs grace-period
kthread if one of its CPUs is generating excessive numbers of callbacks.
But satisfying though it is to wake something up when things are going
south, unless the thing being awakened can actually help solve the
problem, that extra wakeup does nothing but consume additional CPU time,
which is exactly what you don't want during a call_rcu() flood.

This commit therefore avoids doing anything if the corresponding
no-CBs callback kthread is going full tilt. Otherwise, if advancing
callbacks immediately might help and if the leaf rcu_node structure's
lock is immediately available, this commit invokes a new variant of
rcu_advance_cbs() that advances callbacks only if doing so won't require
awakening the grace-period kthread (not to be confused with any of the
no-CBs grace-period kthreads).

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 921bb5fa 21-May-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Use build-time no-CBs check in rcu_pending()

Currently, rcu_pending() invokes rcu_segcblist_is_offloaded() even
in CONFIG_RCU_NOCB_CPU=n kernels, which cannot possibly be offloaded.
Given that rcu_pending() is on a fastpath, it makes sense to check for
CONFIG_RCU_NOCB_CPU=y before invoking rcu_segcblist_is_offloaded().
This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# c1ab99d6 21-May-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Use build-time no-CBs check in rcu_core()

Currently, rcu_core() invokes rcu_segcblist_is_offloaded() each time it
needs to know whether the current CPU is a no-CBs CPU. Given that it is
not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this repeated runtime invocation wastes CPU. This commit
therefore created a const on-stack variable to allow this check to be
done only once per rcu_core() invocation.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# ec5ef87b 21-May-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Use build-time no-CBs check in rcu_do_batch()

Currently, rcu_do_batch() invokes rcu_segcblist_is_offloaded() each time
it needs to know whether the current CPU is a no-CBs CPU. Given that it
is not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this per-callback invocation wastes CPU. This commit therefore
created a const on-stack variable to allow this check to be done only
once per rcu_do_batch() invocation.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# c035280f 21-May-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Remove obsolete nocb_q_count and nocb_q_count_lazy fields

This commit removes the obsolete nocb_q_count and nocb_q_count_lazy
fields, also removing rcu_get_n_cbs_nocb_cpu(), adjusting
rcu_get_n_cbs_cpu(), and making rcutree_migrate_callbacks() once again
disable the ->cblist fields of offline CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 5d6742b3 15-May-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Use rcu_segcblist for no-CBs CPUs

Currently the RCU callbacks for no-CBs CPUs are queued on a series of
ad-hoc linked lists, which means that these callbacks cannot benefit
from "drive-by" grace periods, thus suffering needless delays prior
to invocation. In addition, the no-CBs grace-period kthreads first
wait for callbacks to appear and later wait for a new grace period,
which means that callbacks appearing during a grace-period wait can
be delayed. These delays increase memory footprint, and could even
result in an out-of-memory condition.

This commit therefore enqueues RCU callbacks from no-CBs CPUs on the
rcu_segcblist structure that is already used by non-no-CBs CPUs. It also
restructures the no-CBs grace-period kthread to be checking for incoming
callbacks while waiting for grace periods. Also, instead of waiting
for a new grace period, it waits for the closest grace period that will
cause some of the callbacks to be safe to invoke. All of these changes
reduce callback latency and thus the number of outstanding callbacks,
in turn reducing the probability of an out-of-memory condition.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# e83e73f5 14-May-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Leave ->cblist enabled for no-CBs CPUs

As a first step towards making no-CBs CPUs use the ->cblist, this commit
leaves the ->cblist enabled for these CPUs. The main reason to make
no-CBs CPUs use ->cblist is to take advantage of callback numbering,
which will reduce the effects of missed grace periods which in turn will
reduce forward-progress problems for no-CBs CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 85f69b32 16-Apr-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Check for deferred nocb wakeups before nohz_full early exit

In theory, a timer is used to defer wakeups of no-CBs grace-period
kthreads when the wakeup cannot be done safely directly from the
call_rcu(). In practice, the one-jiffy delay is not always consistent
with timely callback invocation under heavy call_rcu() loads. Therefore,
there are a number of checks for a pending deferred wakeup, including
from the scheduling-clock interrupt. Unfortunately, this check follows
the rcu_nohz_full_cpu() early exit, which renders it useless on such CPUs.

This commit therefore moves the check for the pending deferred no-CB
wakeup to precede the rcu_nohz_full_cpu() early exit.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# c00045be 16-Apr-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Make rcutree_migrate_callbacks() start at leaf rcu_node structure

Because rcutree_migrate_callbacks() is invoked infrequently and because
an exact snapshot of the grace-period state might save some callbacks a
second trip through a grace period, this function has used the root
rcu_node structure. However, this safe-second-trip optimization
happens only if rcutree_migrate_callbacks() races with grace-period
initialization, so it is not worth the added mental load. This commit
therefore makes rcutree_migrate_callbacks() start with the leaf rcu_node
structures, as is done elsewhere.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 750d7f6a 16-Apr-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Add checks for offloaded callback processing

This commit is a preparatory patch for offloaded callbacks using the
same ->cblist structure used by non-offloaded callbacks. It therefore
adds rcu_segcblist_is_offloaded() calls where they will be needed when
!rcu_segcblist_is_enabled() no longer flags the offloaded case. It also
adds checks in rcu_do_batch() to ensure that there are no missed checks:
Currently, it should not be possible for offloaded execution to reach
rcu_do_batch(), though this will change later in this series.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# ce5215c1 12-Apr-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/nocb: Use separate flag to indicate offloaded ->cblist

RCU callback processing currently uses rcu_is_nocb_cpu() to determine
whether or not the current CPU's callbacks are to be offloaded.
This works, but it is not so good for cache locality. Plus use of
->cblist for offloaded callbacks will greatly increase the frequency
of these checks. This commit therefore adds a ->offloaded flag to the
rcu_segcblist structure to provide a more flexible and cache-friendly
means of checking for callback offloading.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 130d9c33 31-Jul-2019 Peter Zijlstra <peterz@infradead.org>

rcu/tree: Fix SCHED_FIFO params

A rather embarrasing mistake had us call sched_setscheduler() before
initializing the parameters passed to it.

Fixes: 1a763fd7c633 ("rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>


# 01b1d88b 26-Jul-2019 Thomas Gleixner <tglx@linutronix.de>

rcu: Use CONFIG_PREEMPTION

CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch the conditionals in RCU to use CONFIG_PREEMPTION.

That's the first step towards RCU on RT. The further tweaks are work in
progress. This neither touches the selftest bits which need a closer look
by Paul.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.210156346@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 1a763fd7 19-Jul-2019 Juri Lelli <juri.lelli@redhat.com>

rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region

sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
called from an invalid (atomic) context by rcu_spawn_gp_kthread().

Fix that by simply moving sched_setscheduler_nocheck() call outside of
the atomic region, as it doesn't actually require to be guarded by
rcu_node lock.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-8-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d5a9a8c3 10-Apr-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Set a maximum limit for back-to-back callback invocation

Currently, if a CPU has more than 10,000 callbacks pending, it will
increase rdp->blimit to LONG_MAX. If you are lucky, LONG_MAX is only
about two billion, but this is still a bit too many callbacks to invoke
back-to-back while otherwise ignoring the world.

This commit therefore sets a maximum limit of DEFAULT_MAX_RCU_BLIMIT,
which is set to 10,000, for rdp->blimit.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# eddded80 26-Mar-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Add checks for dynticks counters in rcu_is_cpu_rrupt_from_idle()

It would be good to combine the dynticks and dynticks_nesting counters
in order to simplify the code. Unfortunately, there are concerns
about usermode upcalls appearing to RCU as half of an interrupt, as
Byungchul learned [1]. The "half" in "half interrupt" is due to an
unpaired rcu_irq_enter(): Normally, each rcu_irq_enter() has a later
call to rcu_irq_exit().

Out of an abundance of caution, Paul added warnings [2] in the RCU
code which if not fired by 2021 will be interpreted as meaning that
this half-interrupt scenario cannot happen any more, thus permitting
simplification of this code.

In the meantime, this commit makes the following changes:

(1) Combining these two counters requires that rcu_rrupt_from_idle()
is invoked only from hard-interrupt contexts as discussed here [3].
This commit therefore adds the required lockdep_assert_in_irq()
to check this constraint.

(2) Furthermore, rcu_rrupt_from_idle() is not explicit about how it
is using the counters which can lead to weird future bugs. This
commit therefore adds comments indicating the meaning and use of
each counter.

(3) Lastly, this commit checks for counter underflows as another check
that half interrupts don't occur. (Previously, the function would
simply return true upon underflow.)

All these checks checks are NOOPs if PROVE_LOCKING (and thus PROVE_RCU)
are disabled.

[1] https://lore.kernel.org/patchwork/patch/952349/
[2] Commit e11ec65cc8d6 ("rcu: Add warning to detect half-interrupts")
[3] https://lore.kernel.org/lkml/20190312150514.GB249405@google.com/

Cc: byungchul.park@lge.com
Cc: kernel-team@android.com
Cc: rcu@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 43e903ad 25-Mar-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Inline invoke_rcu_callbacks() into its sole remaining caller

This commit saves a few lines of code by inlining invoke_rcu_callbacks()
into its sole remaining caller.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 48d07c04 20-Mar-2019 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

rcu: Enable elimination of Tree-RCU softirq processing

Some workloads need to change kthread priority for RCU core processing
without affecting other softirq work. This commit therefore introduces
the rcutree.use_softirq kernel boot parameter, which moves the RCU core
work from softirq to a per-CPU SCHED_OTHER kthread named rcuc. Use of
SCHED_OTHER approach avoids the scalability problems that appeared
with the earlier attempt to move RCU core processing to from softirq
to kthreads. That said, kernels built with RCU_BOOST=y will run the
rcuc kthreads at the RCU-boosting priority.

Note that rcutree.use_softirq=0 must be specified to move RCU core
processing to the rcuc kthreads: rcutree.use_softirq=1 is the default.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Adjust for invoke_rcu_callbacks() only ever being invoked
from RCU core processing, in contrast to softirq->rcuc transition
in old mainline RCU priority boosting. ]
[ paulmck: Avoid wakeups when scheduler might have invoked rcu_read_unlock()
while holding rq or pi locks, also possibly fixing a pre-existing latent
bug involving raise_softirq()-induced wakeups. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# d75f773c 25-Mar-2019 Sakari Ailus <sakari.ailus@linux.intel.com>

treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively

%pF and %pf are functionally equivalent to %pS and %ps conversion
specifiers. The former are deprecated, therefore switch the current users
to use the preferred variant.

The changes have been produced by the following command:

git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

And verifying the result.

Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-pci@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-mm@kvack.org
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>


# 4f5fbd78 26-Mar-2019 Yafang Shao <laoar.shao@gmail.com>

rcu: validate arguments for rcu tracepoints

When CONFIG_RCU_TRACE is not set, all these tracepoints are defined as
do-nothing macro.
We'd better make those inline functions that take proper arguments.

As RCU_TRACE() is defined as do-nothing marco as well when
CONFIG_RCU_TRACE is not set, so we can clean it up.

Link: http://lkml.kernel.org/r/1553602391-11926-4-git-send-email-laoar.shao@gmail.com

Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# b51bcbbf 15-Jan-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Move forward-progress checkers into tree_stall.h

This commit further consolidates stall-warning functionality by moving
forward-progress checkers into kernel/rcu/tree_stall.h, updating a
comment or two while in the area. More specifically, this commit moves
show_rcu_gp_kthreads(), rcu_check_gp_start_stall(), rcu_fwd_progress_check(),
sysrq_rcu, sysrq_show_rcu(), sysrq_rcudump_op, and rcu_sysrq_init() from
kernel/rcu/tree.c to kernel/rcu/tree_stall.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 7ac1907c 14-Jan-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Move irq-disabled stall-warning checking to tree_stall.h

The rcu_iw_handler() function's sole purpose in life is to indicate
whether a stalled CPU had interrupts disabled, so it belongs in
kernel/rcu/tree_stall.h. This commit therefore makes that move,
clarifying its header comment while in the area.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 32255d51 11-Jan-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Move RCU CPU stall-warning code out of tree.c

This commit completes the process of consolidating the code for RCU CPU
stall warnings for normal grace periods by moving the remaining such
code from kernel/rcu/tree.c to kernel/rcu/tree_stall.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 10462d6f 11-Jan-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Move RCU CPU stall-warning code out of update.c

The RCU CPU stall-warning code for normal grace periods is currently
scattered across three files, due to earlier Tiny RCU support for RCU
CPU stall warnings and for old Kconfig options that have long since
been retired. Given that it is hard for the lead RCU maintainer to
find relevant stall-warning code, it would be good to consolidate it.
This commit starts this process by moving stall-warning code from
kernel/rcu/update.c to a new kernel/rcu/tree_stall.h file.

Note that the definitions of rcu_cpu_stall_suppress and
rcu_cpu_stall_timeout must remain in kernel/rcu/update.h to provide
compatibility for kernel boot parameter lists.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 5d8a752e 19-Mar-2019 Zhouyi Zhou <zhouzhouyi@gmail.com>

rcu: Fix force_qs_rnp() header comment

Previously, threads blocked on offlining CPUS were migrated to the
root rcu_node structure, thus requiring RCU priority boosting on this
structure. However, since commit d19fb8d1f3f6 ("rcu: Don't migrate
blocked tasks even if all corresponding CPUs offline"), RCU does not
migrate blocked tasks. Consequently, RCU no longer does RCU priority
boosting on the root rcu_node structure as of commit 1be0085b515e ("rcu:
Don't initiate RCU priority boosting on root rcu_node").

This commit therefore brings comments for the force_qs_rnp() function's
header comment in line with this new no-root-boosting reality.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
[ paulmck: Also remove obsolete comment on suppressing new grace periods. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 85f2b60c 11-Mar-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Update jiffies_to_sched_qs and adjust_jiffies_till_sched_qs() comments

This commit better documents the jiffies_to_sched_qs default-value
strategy used by adjust_jiffies_till_sched_qs()

Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 6973032a 11-Mar-2019 Neeraj Upadhyay <neeraju@codeaurora.org>

rcu: Default jiffies_to_sched_qs to jiffies_till_sched_qs

The current code only calls adjust_jiffies_till_sched_qs() if
jiffies_till_sched_qs is left at its default value, so when the
jiffies_till_sched_qs kernel-boot parameter actually is specified,
jiffies_to_sched_qs will be left with the value zero, which
will result in useless slowdowns of cond_resched(). This commit
therefore changes rcu_init_geometry() to unconditionally invoke
adjust_jiffies_till_sched_qs(), which ensures that jiffies_to_sched_qs
will be initialized in all cases, thus maintaining good cond_resched()
performance.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 0f58d2ac 08-Mar-2019 Neeraj Upadhyay <neeraju@codeaurora.org>

rcu: Fix self-wakeups for grace-period kthread

The current rcu_gp_kthread_wake() function uses in_interrupt()
and thus does a self-wakeup from all interrupt contexts, including
the pointless case where the GP kthread happens to be running with
bottom halves disabled, along with the impossible case where the GP
kthread is running within an NMI handler (you are not supposed to invoke
rcu_gp_kthread_wake() from within an NMI handler. This commit therefore
replaces the in_interrupt() with in_irq(), so that the self-wakeups
happen only from handlers for hardware interrupts and softirqs.
This also makes the code match the comment.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# b2eb85b4 02-Mar-2019 Akira Yokosawa <akiyks@gmail.com>

rcu: Move common code out of if-else block

As the result of recent addition of "rdp->core_needs_qs = false;" in
the "if" block, now both branches of the if-else have the same
assignment.

Factor it out and reduce line count.

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# 3ffe3d1a 21-Feb-2019 Liu Song <fishland@aliyun.com>

rcu: Set rcutree.kthread_prio sysfs access to read-only

The rcutree.kthread_prio kernel-boot parameter is used to set the
priority for boost (rcub), per-CPU (rcuc), and grace-period (rcu_preempt
or rcu_sched) kthreads. It is also used by rcutorture to check whether
it is possible to meaningfully test RCU priority boosting. However,
all of these cases will either ignore or be confused by any post-boot
changes to rcutree.kthread_prio.

Note that the user really can change the priorities of all of these
kthreads using chrt, given sufficient privileges. Therefore, the
read-write nature of sysfs access to rcutree.kthread_prio is thus at
best an attractive nuisance.

This commit therefore changes sysfs access to rcutree.kthread_prio to
be read-only.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 671a6351 19-Jan-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Avoid unnecessary softirq when system is idle

When there are no callbacks pending on an idle system, I noticed that
RCU softirq is continuously firing. During this the cpu_no_qs is set to
false, and core_needs_qs is set to true indefinitely. This causes
rcu_process_callbacks to be repeatedly called, even though the node
corresponding to the CPU has that CPU's mask bit cleared and the system
is idle. I believe the race is when such mask clearing is done during
idle CPU scan of the quiescent state forcing stage in the kthread
instead of the softirq. Since the rnp mask is cleared, but the flags on
the CPU's rdp are not cleared, the CPU thinks it still needs to report
to core RCU.

Cure this by clearing the core_needs_qs flag when the CPU detects that
its node is already updated which will avoid the unwanted softirq raises
to the benefit of real-time systems.

Test: Ran rcutorture for various tree RCU configs.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# e85e6a21 10-Jan-2019 Paul E. McKenney <paulmck@kernel.org>

rcu: Unconditionally expedite during suspend/hibernate

The rcu_pm_notify() function refuses to switch to/from expedited grace
periods on systems with more than 256 CPUs due to the serialized
initialization of expedited grace periods. However, expedited grace
periods are now initialized in parallel, removing this concern.
This commit therefore removes the checks from rcu_pm_notify(), so that
expedited grace periods are used unconditionally during suspend/resume
and hibernate/wake operations.

As always, real-time workloads wishing to completely avoid expedited
grace periods can use the rcupdate.rcu_normal= kernel parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# c13324a5 12-Feb-2019 Masami Hiramatsu <mhiramat@kernel.org>

x86/kprobes: Prohibit probing on functions before kprobe_int3_handler()

Prohibit probing on the functions called before kprobe_int3_handler()
in do_int3(). More specifically, ftrace_int3_handler(),
poke_int3_handler(), and ist_enter(). And since rcu_nmi_enter() is
called by ist_enter(), it also should be marked as NOKPROBE_SYMBOL.

Since those are handled before kprobe_int3_handler(), probing those
functions can cause a breakpoint recursion and crash the kernel.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/154998793571.31052.11301258949601150994.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 22e40925 17-Jan-2019 Paul E. McKenney <paulmck@kernel.org>

rcu/tree: Convert to SPDX license identifier

Replace the license boiler plate with a SPDX license identifier.
While in the area, update an email address.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Update .h file SPDX comment format per Joe Perches. ]
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>


# e81baf4c 10-Dec-2018 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

srcu: Remove srcu_queue_delayed_work_on()

srcu_queue_delayed_work_on() disables preemption (and therefore CPU
hotplug in RCU's case) and then checks based on its own accounting if a
CPU is online. If the CPU is online it uses queue_delayed_work_on()
otherwise it fallbacks to queue_delayed_work().
The problem here is that queue_work() on -RT does not work with disabled
preemption.

queue_work_on() works also on an offlined CPU. queue_delayed_work_on()
has the problem that it is possible to program a timer on an offlined
CPU. This timer will fire once the CPU is online again. But until then,
the timer remains programmed and nothing will happen.

Add a local timer which will fire (as requested per delay) on the local
CPU and then enqueue the work on the specific CPU.

RCUtorture testing with SRCU-P for 24h showed no problems.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 39abefe7 26-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Repair rcu_nmi_exit() docbook header

This commit removes the "@irq" argument from the rcu_nmi_exit() docbook
header, given that this function now has no arguments.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# fb60e533 21-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Rename rcu_process_callbacks() to rcu_core() for Tree RCU

Although the name rcu_process_callbacks() still makes sense for Tiny
RCU, where most of what it does is invoke callbacks, it no longer makes
much sense for Tree RCU, especially given that the actually callback
invocation is relegated to rcu_do_batch(), or, for no-CBs CPUs, to the
rcuo kthreads. Especially in the latter case, rcu_process_callbacks()
has very little to do with actual callbacks. A better description of
this function is that it performs RCU's core processing.

This commit therefore changes the name of Tree RCU's rcu_process_callbacks()
function to rcu_core(), which also has the virtue of being consistent with
the existing invoke_rcu_core() function.

While in the area, the header comment is reworked.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# c98cac60 21-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Rename rcu_check_callbacks() to rcu_sched_clock_irq()

The name rcu_check_callbacks() arguably made sense back in the early
2000s when RCU was quite a bit simpler than it is today, but it has
become quite misleading, especially with the advent of dyntick-idle
and NO_HZ_FULL. The rcu_check_callbacks() function is RCU's hook into
the scheduling-clock interrupt, and is now but one of many ways that
callbacks get promoted to invocable state.

This commit therefore changes the name to rcu_sched_clock_irq(),
which is the same number of characters and clearly indicates this
function's relation to the rest of the Linux kernel. In addition, for
the sake of consistency, rcu_flavor_check_callbacks() is also renamed
to rcu_flavor_sched_clock_irq().

While in the area, the header comments for both functions are reworked.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 13dc7d0c 19-Dec-2018 Zhang, Jun <jun.zhang@intel.com>

rcu: Prevent needless ->gp_seq_needed update in __note_gp_changes()

Currently, __note_gp_changes() checks to see if the rcu_node structure's
->gp_seq_needed is greater than or equal to that of the rcu_data
structure, and if so, updates the rcu_data structure's ->gp_seq_needed
field. This results in a useless store in the case where the two fields
are equal.

This commit therefore carries out this store only in the case where the
rcu_node structure's ->gp_seq_needed is strictly greater than that of
the rcu_data structure.

Signed-off-by: "Zhang, Jun" <jun.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Link: https://lkml.kernel.org/r/88DC34334CA3444C85D647DBFA962C2735AD5F77@SHSMSX104.ccr.corp.intel.com


# 1d1f898d 18-Dec-2018 Zhang, Jun <jun.zhang@intel.com>

rcu: Do RCU GP kthread self-wakeup from softirq and interrupt

The rcu_gp_kthread_wake() function is invoked when it might be necessary
to wake the RCU grace-period kthread. Because self-wakeups are normally
a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from
this kthread, it naturally refuses to do the wakeup.

Unfortunately, natural though it might be, this heuristic fails when
rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler
that interrupted the grace-period kthread just after the final check of
the wait-event condition but just before the schedule() call. In this
case, a wakeup is required, even though the call to rcu_gp_kthread_wake()
is within the RCU grace-period kthread's context. Failing to provide
this wakeup can result in grace periods failing to start, which in turn
results in out-of-memory conditions.

This race window is quite narrow, but it actually did happen during real
testing. It would of course need to be fixed even if it was strictly
theoretical in nature.

This patch does not Cc stable because it does not apply cleanly to
earlier kernel versions.

Fixes: 48a7639ce80c ("rcu: Make callers awaken grace-period kthread")
Reported-by: "He, Bo" <bo.he@intel.com>
Co-developed-by: "Zhang, Jun" <jun.zhang@intel.com>
Co-developed-by: "He, Bo" <bo.he@intel.com>
Co-developed-by: "xiao, jin" <jin.xiao@intel.com>
Co-developed-by: Bai, Jie A <jie.a.bai@intel.com>
Signed-off: "Zhang, Jun" <jun.zhang@intel.com>
Signed-off: "He, Bo" <bo.he@intel.com>
Signed-off: "xiao, jin" <jin.xiao@intel.com>
Signed-off: Bai, Jie A <jie.a.bai@intel.com>
Signed-off-by: "Zhang, Jun" <jun.zhang@intel.com>
[ paulmck: Switch from !in_softirq() to "!in_interrupt() &&
!in_serving_softirq() to avoid redundant wakeups and to also handle the
interrupt-handler scenario as well as the softirq-handler scenario that
actually occurred in testing. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com


# 2ccaff10 12-Dec-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add sysrq rcu_node-dump capability

Life is hard if RCU manages to get stuck without triggering RCU CPU
stall warnings or triggering the rcu_check_gp_start_stall() checks
for failing to start a grace period. This commit therefore adds a
boot-time-selectable sysrq key (commandeering "y") that allows manually
dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
must be set for this sysrq to be available.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 3b6505fd 12-Dec-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Protect rcu_check_gp_kthread_starvation() access to ->gp_flags

The rcu_check_gp_kthread_starvation() function can be invoked without
holding locks, so the access to the rcu_state structure's ->gp_flags
field must be protected with READ_ONCE(). This commit therefore adds
this protection.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# fd897573 10-Dec-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Improve diagnostics for failed RCU grace-period start

If a grace period fails to start (for example, because you commented
out the last two lines of rcu_accelerate_cbs_unlocked()), rcu_core()
will invoke rcu_check_gp_start_stall(), which will notice and complain.
However, this complaint is lacking crucial debugging information such
as when the last wakeup executed and what the value of ->gp_seq was at
that time. This commit therefore removes the current pr_alert() from
rcu_check_gp_start_stall(), instead invoking show_rcu_gp_kthreads(),
which has been updated to print the needed information, which is collected
by rcu_gp_kthread_wake().

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 9cf422a8 20-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Accommodate zero jiffies_till_first_fqs and kthread kicking

It is perfectly fine to set the rcutree.jiffies_till_first_fqs boot
parameter to zero, in fact, this can be useful on specialty systems that
usually have at least one idle CPU and that need fast grace periods.
This is because this setting causes the RCU grace-period kthread to
scan for idle threads immediately after grace-period initialization,
as opposed to waiting several jiffies to do so.

It is also perfectly fine to set the rcutree.rcu_kick_kthreads kernel
parameter, which gives the RCU grace-period kthread an extra wakeup
if it doesn't make progress for a period of three times the setting of
the rcutree.jiffies_till_first_fqs boot parameter. This is of course
problematic when the value of this parameter is zero, as it can result
in unnecessary wakeup IPIs along with unnecessary WARN_ONCE() invocations.

This commit therefore defers kthread kicking for at least two jiffies,
regardless of the setting of rcutree.jiffies_till_first_fqs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 260e1e4f 29-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Discard separate per-CPU callback counts

Back when there were multiple flavors of RCU, it was necessary to
separately count lazy and non-lazy callbacks for each CPU. These counts
were used in CONFIG_RCU_FAST_NO_HZ kernels to determine how long a newly
idle CPU should be allowed to sleep before handling its RCU callbacks.
But now that there is only one flavor, the callback counts for a given
CPU's sole rcu_data structure are the counts for that CPU.

This commit therefore removes the rcu_data structure's ->nonlazy_posted
and ->nonlazy_posted_snap fields, the rcu_idle_count_callbacks_posted()
and rcu_cpu_has_callbacks() functions, repurposes the rcu_data structure's
->all_lazy field to record the laziness state at the beginning of the
latest idle sojourn, and modifies CONFIG_RCU_FAST_NO_HZ RCU CPU stall
warnings accordingly.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# e5bc3af7 29-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate PREEMPT and !PREEMPT synchronize_rcu()

Now that rcu_blocking_is_gp() makes the correct immediate-return
decision for both PREEMPT and !PREEMPT, a single implementation of
synchronize_rcu() will work correctly under both configurations.
This commit therefore eliminates a few lines of code by consolidating
the two implementations of synchronize_rcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# c97058d0 28-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate RCU_BH_FLAVOR and RCU_SCHED_FLAVOR

Now that the RCU flavors have been consolidated, RCU_BH_FLAVOR and
RCU_SCHED_FLAVOR are no longer used. This commit therefore saves a
few lines by removing them.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# cd920e5a 28-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Inline force_quiescent_state() into rcu_force_quiescent_state()

Given that rcu_force_quiescent_state() is a simple wrapper around
force_quiescent_state(), this commit saves a few lines of code by
inlining force_quiescent_state() into rcu_force_quiescent_state(),
and changing all references to force_quiescent_state() to instead
invoke rcu_force_quiescent_state().

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# ad368d15 27-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Rename and comment changes due to only one rcuo kthread per CPU

Given RCU flavor consolidation, the name rcu_spawn_all_nocb_kthreads()
is quite misleading. It no longer ever creates more than one kthread,
and it does so only for the specified CPU. This commit therefore changes
this name to the more descriptive rcu_spawn_cpu_nocb_kthread(), and also
fixes up a similar issue in its header comment while in the area.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# c51d7b5e 03-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Print time since GP end upon forward-progress failure

If rcutorture's forward-progress tests fail while a grace period is not
in progress, it is useful to print the time since the last grace period
ended as a way to detect failure to launch a new grace period. This
commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 8dd3b546 03-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Print GP age upon forward-progress failure

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# bfcfcffc 02-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Print per-CPU callback counts for forward-progress failures

This commit prints out the non-zero per-CPU callback counts when a
forware-progress error (OOM event) occurs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Fix a pair of uninitialized locals spotted by kbuild test robot. ]


# 903ee83d 02-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings

The RCU CPU stall warnings print an estimate of the total number of
RCU callbacks queued in the system, but this estimate leaves out
the callbacks queued for nocbs CPUs. This commit therefore introduces
rcu_get_n_cbs_cpu(), which gives an accurate callback estimate for
both nocbs and normal CPUs, and uses this new function as needed.

This commit also introduces a rcu_get_n_cbs_nocb_cpu() helper function
that returns the number of callbacks for nocbs CPUs or zero otherwise,
and also uses this function in place of direct access to ->nocb_q_count
while in the area (fewer characters, you see).

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# e0aff973 01-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Dump grace-period diagnostics upon forward-progress OOM

This commit adds an OOM notifier during rcutorture forward-progress
testing. If this notifier is invoked, it dumps out some grace-period
state to help debug the forward-progress problem.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 0a89e5a4 15-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Trace end of grace period before end of grace period

Currently, rcu_gp_cleanup() traces the end of the old grace period after
the old grace period has officially ended. This might make intuitive
sense, but it also makes for confusing event-trace output because the
"end" trace displays not the old but instead the new grace-period number.
This commit therefore traces the end of an old grace period just before
that grace period officially ends.

Reported-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 2320bda2 08-Oct-2018 Zhouyi Zhou <zhouzhouyi@gmail.com>

rcu: Adjust the comment of function rcu_is_watching

Because RCU avoids interrupting idle CPUs, rcu_is_watching() is used to
test whether or not it is currently legal to run RCU read-side critical
sections on this CPU. However, the first sentence and last sentences
of current comment for rcu_is_watching have opposite meaning of what
is expected. This commit therefore fixes this header comment.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# c669c014 02-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add jiffies-since-GP-activity to show_rcu_gp_kthreads()

This commit adds a printout of the number of jiffies since the last time
that the RCU grace-period kthread did any processing. This can be useful
when tracking down forward-progress issues.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 69196019 02-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add state name to show_rcu_gp_kthreads() output

This commit adds the name of the RCU grace-period state to
the show_rcu_gp_kthreads() output in order to ease debugging.
This commit also moves gp_state_getname() up in the code so that
show_rcu_gp_kthreads() can use it.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 791416c4 01-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Parameterize rcu_check_gp_start_stall()

In order to debug forward-progress stalls, it is necessary to check
for excessively delayed grace-period starts. This is currently done
for RCU CPU stall warnings by rcu_check_gp_start_stall(), which checks
to see if the start of a requested grace period has been delayed by an
RCU CPU stall warning period. Because rcutorture will need to check
for the time consumed by an RCU forward-progress delay, this commit
promotes gpssdelay from a local variable to a formal parameter. It is
not necessary to export rcu_check_gp_start_stall() because rcutorture
will access it via a wrapper function.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# b3c1d9ec 01-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid double multiply by HZ

The rcu_check_gp_start_stall() function multiplies the return value
from rcu_jiffies_till_stall_check() by HZ, but the units are already
in jiffies. This commit therefore avoids the need for introduction of
a jiffies-squared unit by removing the extraneous multiplication.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# 08543bda 22-Oct-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate BUG_ON() for kernel/rcu/tree.c

The tree.c file has a number of calls to BUG_ON(), which panics the
kernel, which is not a good strategy for devices (like embedded) that
don't have a way to capture console output. This commit therefore
converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE().

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>


# e0fcba9a 14-Aug-2018 Paul E. McKenney <paulmck@kernel.org>

srcu: Make call_srcu() available during very early boot

Event tracing is moving to SRCU in order to take advantage of the fact
that SRCU may be safely used from idle and even offline CPUs. However,
event tracing can invoke call_srcu() very early in the boot process,
even before workqueue_init_early() is invoked (let alone rcu_init()).
Therefore, call_srcu()'s attempts to queue work fail miserably.

This commit therefore detects this situation, and refrains from attempting
to queue work before rcu_init() time, but does everything else that it
would have done, and in addition, adds the srcu_struct to a global list.
The rcu_init() function now invokes a new srcu_init() function, which
is empty if CONFIG_SRCU=n. Otherwise, srcu_init() queues work for
each srcu_struct on the list. This all happens early enough in boot
that there is but a single CPU with interrupts disabled, which allows
synchronization to be dispensed with.

Of course, the queued work won't actually be invoked until after
workqueue_init() is invoked, which happens shortly after the scheduler
is up and running. This means that although call_srcu() may be invoked
any time after per-CPU variables have been set up, there is still a very
narrow window when synchronize_srcu() won't work, and this window
extends from the time that the scheduler starts until the time that
workqueue_init() returns. This can be fixed in a manner similar to
the fix for synchronize_rcu_expedited() and friends, but until someone
actually needs to use synchronize_srcu() during this window, this fix
is added churn for no benefit.

Finally, note that Tree SRCU's new srcu_init() function invokes
queue_work() rather than the queue_delayed_work() function that is
invoked post-boot. The reason is that queue_delayed_work() will (as you
would expect) post a timer, and timers have not yet been initialized.
So use of queue_work() avoids the complaints about use of uninitialized
spinlocks that would otherwise result. Besides, some delay is already
provide by the aforementioned fact that the queued work won't actually
be invoked until after the scheduler is up and running.

Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 894d45bb 15-Aug-2018 Mike Galbraith <efault@gmx.de>

rcu: Convert rcu_state.ofl_lock to raw_spinlock_t

1e64b15a4b10 ("rcu: Fix grace-period hangs due to race with CPU offline")
added spinlock_t ofl_lock to the rcu_state structure, then takes it with
preemption disabled during CPU offline, which gives the -rt patchset's
sleeping spinlock heartburn.

This commit therefore converts ->ofl_lock to raw_spinlock_t.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>


# 8d8a9d0e 04-Aug-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove obsolete ->dynticks_fqs and ->cond_resched_completed

The rcu_data structure's ->dynticks_fqs is incremented but never
accesses. Its ->cond_resched_completed field isn't used at all.
This commit therefore removes both fields.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# dc5a4f29 03-Aug-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Switch ->dynticks to rcu_data structure, remove rcu_dynticks

This commit move ->dynticks from the rcu_dynticks structure to the
rcu_data structure, replacing the field of the same name. It also updates
the code to access ->dynticks from the rcu_data structure and to use the
rcu_data structure rather than following to now-gone ->dynticks field
to the now-gone rcu_dynticks structure. While in the area, this commit
also fixes up comments.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4c5273bf 03-Aug-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Switch dyntick nesting counters to rcu_data structure

This commit removes ->dynticks_nesting and ->dynticks_nmi_nesting from
the rcu_dynticks structure and updates the code to access them from the
rcu_data structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2dba13f0 03-Aug-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Switch urgent quiescent-state requests to rcu_data structure

This commit removes ->rcu_need_heavy_qs and ->rcu_urgent_qs from the
rcu_dynticks structure and updates the code to access them from the
rcu_data structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# df63fa5b 31-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert "1UL << x" to "BIT(x)"

This commit saves a few characters by converting "1UL << x" to "BIT(x)".

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fced9c8c 26-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid resched_cpu() when rescheduling the current CPU

The resched_cpu() interface is quite handy, but it does acquire the
specified CPU's runqueue lock, which does not come for free. This
commit therefore substitutes the following when directing resched_cpu()
at the current CPU:

set_tsk_need_resched(current);
set_preempt_need_resched();

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>


# d3052109 25-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: More aggressively enlist scheduler aid for nohz_full CPUs

Because nohz_full CPUs can leave the scheduler-clock interrupt disabled
even when in kernel mode, RCU cannot rely on rcu_check_callbacks() to
enlist the scheduler's aid in extracting a quiescent state from such CPUs.
This commit therefore more aggressively uses resched_cpu() on nohz_full
CPUs that fail to pass through a quiescent state in a timely manner.
By default, the resched_cpu() beating starts 300 milliseconds into the
quiescent state.

While in the neighborhood, add a ->last_fqs_resched field to the rcu_data
structure in order to rate-limit resched_cpu() calls from the RCU
grace-period kthread.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c06aed0e 25-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Compute jiffies_till_sched_qs from other kernel parameters

The jiffies_till_sched_qs value used to determine how old a grace period
must be before RCU enlists the help of the scheduler to force a quiescent
state on the holdout CPU. Currently, this defaults to HZ/10 regardless of
system size and may be set only at boot time. This can be a problem for
very large systems, because if the values of the jiffies_till_first_fqs
and jiffies_till_next_fqs kernel parameters are left at their defaults,
they are calculated to increase as the number of CPUs actually configured
on the system increases. Thus, on a sufficiently large system, RCU would
enlist the help of the scheduler before the grace-period kthread had a
chance to scan for idle CPUs, which wastes CPU time.

This commit therefore allows jiffies_till_sched_qs to be set, if desired,
but if left as default, computes is as jiffies_till_first_fqs plus twice
jiffies_till_next_fqs, thus allowing three force-quiescent-state scans
for idle CPUs. This scales with the number of CPUs, providing sensible
default values.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 7e28c5af 11-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate ->rcu_qs_ctr from the rcu_dynticks structure

The ->rcu_qs_ctr counter was intended to allow providing a lightweight
report of a quiescent state to all RCU flavors. But now that there is
only one flavor of RCU in any one running kernel, there is no point in
having this feature. This commit therefore removes the ->rcu_qs_ctr
field from the rcu_dynticks structure and the ->rcu_qs_ctr_snap field
from the rcu_data structure. This results in the "rqc" option to the
rcu_fqs trace event no longer being used, so this commit also removes the
"rqc" description from the header comment.

While in the neighborhood, this commit also causes the forward-progress
request .rcu_need_heavy_qs be set one jiffies_till_sched_qs interval
later in the grace period than the first setting of .rcu_urgent_qs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a0ef9ec2 09-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide improved interrupt-from-idle check in rcu_check_callbacks()

The patch making need_resched() respond to urgent RCU-QS needs used
is_idle_task(current) to detect an interrupt from idle, which does work
reasonably, but is (in theory at least) vulnerable to loops containing
need_resched() invoked from within RCU_NONIDLE() or its tracepoint
equivalent. This commit therefore moves rcu_is_cpu_rrupt_from_idle()
to a place from which rcu_check_callbacks() can invoke it and replaces
the is_idle_task(current) with rcu_is_cpu_rrupt_from_idle().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 92aa39e9 09-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make need_resched() respond to urgent RCU-QS needs

The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
need for an RCU quiescent state from the force-quiescent-state processing
within the grace-period kthread to context switches and to cond_resched().
Unfortunately, such urgent needs are not communicated to need_resched(),
which is sometimes used to decide when to invoke cond_resched(), for
but one example, within the KVM vcpu_run() function. As of v4.15, this
can result in synchronize_sched() being delayed by up to ten seconds,
which can be problematic, to say nothing of annoying.

This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
rcu_check_callbacks(), which is invoked from the scheduling-clock
interrupt handler. If the current task is not an idle task and is
not executing in usermode, a context switch is forced, and either way,
the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current
task is an idle task, then RCU's dyntick-idle code will detect the
quiescent state, so no further action is required. Similarly, if the
task is executing in usermode, other code in rcu_check_callbacks() and
its called functions will report the corresponding quiescent state.

Reported-by: Marius Hillenbrand <mhillenb@amazon.de>
Reported-by: David Woodhouse <dwmw2@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# dd46a788 10-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Inline _rcu_barrier() into its sole remaining caller

Because rcu_barrier() is a one-line wrapper function for _rcu_barrier()
and because nothing else calls _rcu_barrier(), this commit inlines
_rcu_barrier() into rcu_barrier().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 395a2f09 10-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Define rcu_all_qs() only in !PREEMPT builds

Now that rcu_all_qs() is used only in !PREEMPT builds, move it to
tree_plugin.h so that it is defined only in those builds. This in
turn means that rcu_momentary_dyntick_idle() is only used in !PREEMPT
builds, but it is simply marked __maybe_unused in order to keep it
near the rest of the dyntick-idle code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 49918a54 07-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Clean up flavor-related definitions and comments in tree.c

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# de3875d3 07-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove now-unused rcutorture APIs

This commit removes rcu_sched_get_gp_seq(), rcu_bh_get_gp_seq(),
rcu_exp_batches_completed_sched(), rcu_sched_force_quiescent_state(),
and rcu_bh_force_quiescent_state(), which are no longer used because
rcutorture no longer does "rcu_bh" and "rcu_sched" torture types.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a8bb74ac 06-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate RCU-sched update-side function definitions

This commit saves a few lines by consolidating the RCU-sched function
definitions at the end of include/linux/rcupdate.h. This consolidation
also makes it easier to remove them all when the time comes.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4c7e9c14 06-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate RCU-bh update-side function definitions

This commit saves a few lines by consolidating the RCU-bh function
definitions at the end of include/linux/rcupdate.h. This consolidation
also makes it easier to remove them all when the time comes.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c3854a05 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Pull rcu_gp_kthread() FQS loop into separate function

The rcu_gp_kthread() function is long and deeply indented, so this
commit pulls the loop that repeatedly invokes rcu_gp_fqs() into a new
rcu_gp_fqs_loop() function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4e95020c 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Inline increment_cpu_stall_ticks() into its sole caller

Consolidation of the RCU flavors into one makes increment_cpu_stall_ticks()
a trivial one-line function with only one caller. This commit therefore
inlines it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8ff0b907 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix typo in force_qs_rnp()'s parameter's parameter

Pointers to rcu_data structures should be named rdp, not rsp. This
commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# eb7a6653 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate initialization-time use of rsp

Now that there is only one rcu_state structure, there is less point in
maintaining a pointer to it. This commit therefore replaces rsp with
&rcu_state in rcu_cpu_starting() and rcu_init_one().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ec9f5835 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate RCU-barrier use of rsp

Now that there is only one rcu_state structure, there is less point
in maintaining a pointer to it. This commit therefore replaces rsp
with &rcu_state in rcu_barrier_callback(), rcu_barrier_func(), and
_rcu_barrier().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 67a0edbf 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate quiescent-state and grace-period-nonstart use of rsp

Now that there is only one rcu_state structure, there is less point in
maintaining a pointer to it. This commit therefore replaces rsp with
&rcu_state in rcu_report_qs_rnp(), force_quiescent_state(), and
rcu_check_gp_start_stall().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3c779dfe 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate callback-invocation/invocation use of rsp

Now that there is only one rcu_state structure, there is less point in
maintaining a pointer to it. This commit therefore replaces rsp with
&rcu_state in rcu_do_batch(), invoke_rcu_callbacks(), and __call_rcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9cbc5b97 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate grace-period management code use of rsp

Now that there is only one rcu_state structure, there is less point
in maintaining a pointer to it. This commit therefore replaces
rsp with &rcu_state in rcu_start_this_gp(), rcu_accelerate_cbs(),
__note_gp_changes(), rcu_gp_init(), rcu_gp_fqs(), rcu_gp_cleanup(),
rcu_gp_kthread(), and rcu_report_qs_rsp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4c6ed437 05-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate stall-warning use of rsp

Now that there is only one rcu_state structure, there is less point
in maintaining a pointer to it. This commit therefore replaces rsp
with &rcu_state in print_other_cpu_stall(), print_cpu_stall(), and
check_cpu_stall().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 7cba4775 04-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Restructure rcu_check_gp_kthread_starvation()

This commit removes the rsp and gpa local variables, repurposes the j
local variable and adds a gpk (GP kthread) local to improve readability.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f7dd7d44 04-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Simplify rcutorture_get_gp_data()

This commit restructures rcutorture_get_gp_data() to take advantage of
the fact that there is only one flavor of RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b97d23c5 04-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove for_each_rcu_flavor() flavor-traversal macro

Now that there is only ever a single flavor of RCU in a given kernel
build, there isn't a whole lot of point in having a flavor-traversal
macro. This commit therefore removes it and converts calls to it to
straightline code, inlining trivial functions as appropriate.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 88d1bead 04-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rcu_data structure's ->rsp field

Now that there is only one rcu_state structure, there is no need for the
rcu_data structure to indicate which it corresponds to. This commit
therefore removes the rcu_data structure's ->rsp field, replacing all
remaining uses of it with &rcu_state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# aedf4ba9 04-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_node tree accessor macros

There now is only one rcu_state structure in a given build of the Linux
kernel, so there is no need to pass it as a parameter to RCU's rcu_node
tree's accessor macros. This commit therefore removes the rsp parameter
from those macros in kernel/rcu/rcu.h, and removes some now-unused rsp
local variables while in the area.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 63d4c8c9 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from expedited grace-period functions

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to
RCU's functions. This commit therefore removes the rsp parameter
from the code in kernel/rcu/tree_exp.h, and removes all of the
rsp local variables while in the area.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4580b054 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from no-CBs CPU functions

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to
RCU's functions. This commit therefore removes the rsp parameter
from rcu_nocb_cpu_needs_barrier(), rcu_spawn_one_nocb_kthread(),
rcu_organize_nocb_kthreads(), rcu_nocb_cpu_needs_barrier(), and
rcu_nohz_full_cpu().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b21ebed9 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from print_cpu_stall_info()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
print_cpu_stall_info().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 81ab59a3 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from dump_blkd_tasks() and friend

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
dump_blkd_tasks() and rcu_preempt_blocked_readers_cgp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a2887cd8 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_print_detail_task_stall()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_print_detail_task_stall().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b8bb1f63 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_init_one() and friends

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_init_one() and rcu_dump_rcu_node_tree().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 53b46303 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_boot_init_percpu_data() and friends

There now is only one rcu_state structure in a given build of
the Linux kernel, so there is no need to pass it as a parameter
to RCU's functions. This commit therefore removes the rsp
parameter from rcu_boot_init_percpu_data(), rcu_init_percpu_data(),
rcu_cleanup_dying_idle_cpu(), and rcu_migrate_callbacks(). While in
the neighborhood, line the last three into rcutree_prepare_cpu(),
rcu_report_dead() and rcutree_migrate_callbacks(), respectively.
This also gets rid of the for_each_rcu_flavor() calls that were in
those tree functions.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8344b871 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from _rcu_barrier() and friends

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
_rcu_barrier_trace() and _rcu_barrier().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 98ece508 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from __rcu_pending()

There now is only one rcu_state structure in a given build of the Linux
kernel, so there is no need to pass it as a parameter to RCU's functions.
This commit therefore removes the rsp parameter from __rcu_pending(),
and also inlines it into rcu_pending(), removing the for_each_rcu_flavor()
while in the neighborhood..

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5c7d8967 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from __call_rcu() and friend

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
__call_rcu_core() and __call_rcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b049fdf8 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from __rcu_process_callbacks()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
__rcu_process_callbacks(), and also inlines it into rcu_process_callbacks(),
removing the for_each_rcu_flavor() while in the neighborhood.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b96f9dc4 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_check_gp_start_stall()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_check_gp_start_stall().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e9ecb780 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from force-quiescent-state functions

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
force_qs_rnp() and force_quiescent_state().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5bb5d09c 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_do_batch()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_do_batch().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 780cd590 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from CPU hotplug functions

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_cleanup_dying_cpu() and rcu_cleanup_dead_cpu(). And, as long as
we are in the neighborhood, inlines them into rcutree_dying_cpu() and
rcutree_dead_cpu(), respectively. This also eliminates a pair of
for_each_rcu_flavor() loops.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8087d3e3 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_check_quiescent_state()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_check_quiescent_state().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 0854a05c 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_gp_kthread() and friends

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_gp_init(), rcu_gp_fqs_check_wake(), rcu_gp_fqs(), rcu_gp_cleanup(),
and rcu_gp_kthread().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 22212332 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_gp_slow()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_gp_slow().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 15cabdff 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from note_gp_changes()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
note_gp_changes().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c7e48f7b 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from __note_gp_changes()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
__note_gp_changes().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 834f56bf 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_advance_cbs()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_advance_cbs().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c6e09b97 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_accelerate_cbs_unlocked()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_accelerate_cbs_unlocked().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 02f50142 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_accelerate_cbs()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_accelerate_cbs().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 532c00c9 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_gp_kthread_wake()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_gp_kthread_wake().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3481f2ea 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_future_gp_cleanup()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_future_gp_cleanup().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ea12ff2b 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from check_cpu_stall()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
check_cpu_stall().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4e8b8e08 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from print_cpu_stall()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
print_cpu_stall().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a91e7e58 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from print_other_cpu_stall()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
print_other_cpu_stall().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e1741c69 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_stall_kick_kthreads()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_stall_kick_kthreads().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 33dbdbf0 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_dump_cpu_stacks()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_dump_cpu_stacks().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8fd119b6 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_check_gp_kthread_starvation()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_check_gp_kthread_starvation().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ad3832e9 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from record_gp_stall_check_time()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
record_gp_stall_check_time().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 336a4f6c 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_get_root()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_get_root().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# de8e8730 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_gp_in_progress()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_gp_in_progress().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 33085c46 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_report_qs_rdp()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_report_qs_rdp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 139ad4da 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_report_unblock_qs_rnp()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_report_unblock_qs_rnp(), which is particularly appropriate in
this case given that this parameter is no longer used.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# aff4e9ed 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_report_qs_rsp()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_report_qs_rsp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b50912d0 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rsp parameter from rcu_report_qs_rnp()

There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_report_qs_rnp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2280ee5a 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rcu_data_p pointer to default rcu_data structure

The rcu_data_p pointer references the default set of per-CPU rcu_data
structures, that is, those that call_rcu() uses, as opposed to
call_rcu_bh() and sometimes call_rcu_sched(). But there is now only one
set of per-CPU rcu_data structures, so that one set is by definition
the default, which means that the rcu_data_p pointer no longer serves
any useful purpose. This commit therefore removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 16fc9c60 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rcu_state_p pointer to default rcu_state structure

The rcu_state_p pointer references the default rcu_state structure,
that is, the one that call_rcu() uses, as opposed to call_rcu_bh()
and sometimes call_rcu_sched(). But there is now only one rcu_state
structure, so that one structure is by definition the default, which
means that the rcu_state_p pointer no longer serves any useful purpose.
This commit therefore removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# da1df50d 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rcu_state structure's ->rda field

The rcu_state structure's ->rda field was used to find the per-CPU
rcu_data structures corresponding to that rcu_state structure. But now
there is only one rcu_state structure (creatively named "rcu_state")
and one set of per-CPU rcu_data structures (creatively named "rcu_data").
Therefore, uses of the ->rda field can always be replaced by "rcu_data,
and this commit makes that change and removes the ->rda field.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ec5dd444 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate rcu_state structure's ->call field

The rcu_state structure's ->call field references the corresponding RCU
flavor's call_rcu() function. However, now that there is only ever one
rcu_state structure in a given build of the Linux kernel, and that flavor
uses plain old call_rcu(), there is not a lot of point in continuing to
have the ->call field. This commit therefore removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 358be2d3 03-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove RCU_STATE_INITIALIZER()

Now that a given build of the Linux kernel has only one set of rcu_state,
rcu_node, and rcu_data structures, there is no point in creating a macro
to declare and compile-time initialize them. This commit therefore
just does normal declaration and compile-time initialization of these
structures. While in the area, this commit also removes #ifndefs of
the no-longer-ever-defined preprocessor macro RCU_TREE_NONCORE.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 45975c7d 02-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds

Now that RCU-preempt knows about preemption disabling, its implementation
of synchronize_rcu() works for synchronize_sched(), and likewise for the
other RCU-sched update-side API members. This commit therefore confines
the RCU-sched update-side code to CONFIG_PREEMPT=n builds, and defines
RCU-sched's update-side API members in terms of those of RCU-preempt.

This means that any given build of the Linux kernel has only one
update-side flavor of RCU, namely RCU-preempt for CONFIG_PREEMPT=y builds
and RCU-sched for CONFIG_PREEMPT=n builds. This in turn means that kernels
built with CONFIG_RCU_NOCB_CPU=y have only one rcuo kthread per CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>


# 4cf439a2 02-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix typo in rcu_get_gp_kthreads_prio() header comment

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2bbfc25b 02-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Drop "wake" parameter from rcu_report_exp_rdp()

The rcu_report_exp_rdp() function is always invoked with its "wake"
argument set to "true", so this commit drops this parameter. The only
potential call site that would use "false" is in the code driving the
expedited grace period, and that code uses rcu_report_exp_cpu_mult()
instead, which therefore retains its "wake" parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 82fcecfa 02-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Update comments and help text for no more RCU-bh updaters

This commit updates comments and help text to account for the fact that
RCU-bh update-side functions are now simple wrappers for their RCU or
RCU-sched counterparts.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 65cfe358 01-Jul-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Define RCU-bh update API in terms of RCU

Now that the main RCU API knows about softirq disabling and softirq's
quiescent states, the RCU-bh update code can be dispensed with.
This commit therefore removes the RCU-bh update-side implementation and
defines RCU-bh's update-side API in terms of that of either RCU-preempt or
RCU-sched, depending on the setting of the CONFIG_PREEMPT Kconfig option.

In kernels built with CONFIG_RCU_NOCB_CPU=y this has the knock-on effect
of reducing by one the number of rcuo kthreads per CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d28139c4 28-Jun-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe

One necessary step towards consolidating the three flavors of RCU is to
make sure that the resulting consolidated "one flavor to rule them all"
correctly handles networking denial-of-service attacks. One thing that
allows RCU-bh to do so is that __do_softirq() invokes rcu_bh_qs() every
so often, and so something similar has to happen for consolidated RCU.

This must be done carefully. For example, if a preemption-disabled
region of code takes an interrupt which does softirq processing before
returning, consolidated RCU must ignore the resulting rcu_bh_qs()
invocations -- preemption is still disabled, and that means an RCU
reader for the consolidated flavor.

This commit therefore creates a new rcu_softirq_qs() that is called only
from the ksoftirqd task, thus avoiding the interrupted-a-preempted-region
problem. This new rcu_softirq_qs() function invokes rcu_sched_qs(),
rcu_preempt_qs(), and rcu_preempt_deferred_qs(). The latter call handles
any deferred quiescent states.

Note that __do_softirq() still invokes rcu_bh_qs(). It will continue to
do so until a later stage of cleanup when the RCU-bh flavor is removed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix !SMP issue located by kbuild test robot. ]


# e11ec65c 28-Jun-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add warning to detect half-interrupts

RCU's dyntick-idle code is written to tolerate half-interrupts, that it,
either an interrupt that invokes rcu_irq_enter() but never invokes the
corresponding rcu_irq_exit() on the one hand, or an interrupt that never
invokes rcu_irq_enter() but does invoke the "corresponding" rcu_irq_exit()
on the other. These things really did happen at one time, as evidenced
by this ca-2011 LKML post:

http://lkml.kernel.org/r/20111014170019.GE2428@linux.vnet.ibm.com

The reason why RCU tolerates half-interrupts is that usermode helpers
used exceptions to invoke a system call from within the kernel such that
the system call did a normal return (not a return from exception) to
the calling context. This caused rcu_irq_enter() to be invoked without
a matching rcu_irq_exit(). However, usermode helpers have since been
rewritten to make much more housebroken use of workqueues, kernel threads,
and do_execve(), and therefore should no longer produce half-interrupts.
No one knows of any other source of half-interrupts, but then again,
no one seems insane enough to go audit the entire kernel to verify that
half-interrupts really are a relic of the past.

This commit therefore adds a pair of WARN_ON_ONCE() calls that will
trigger in the presence of half interrupts, which the code will continue
to handle correctly. If neither of these WARN_ON_ONCE() trigger by
mid-2021, then perhaps RCU can stop handling half-interrupts, which
would be a considerable simplification.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>


# 3e310098 21-Jun-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Defer reporting RCU-preempt quiescent states when disabled

This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.

This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.

Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.

In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.

Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.

Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]


# cf7614e1 22-Jun-2018 Byungchul Park <byungchul.park@lge.com>

rcu: Refactor rcu_{nmi,irq}_{enter,exit}()

When entering or exiting irq or NMI handlers, the current code uses
->dynticks_nmi_nesting to detect if it is in the outermost handler,
that is, the one interrupting or returning to an RCU-idle context (the
idle loop or nohz_full usermode execution). When entering the outermost
handler via an interrupt (as opposed to NMI), it is necessary to invoke
rcu_dynticks_task_exit() just before the CPU is marked non-idle from an
RCU perspective and to invoke rcu_cleanup_after_idle() just after the
CPU is marked non-idle. Similarly, when exiting the outermost handler
via an interrupt, it is necessary to invoke rcu_prepare_for_idle() just
before marking the CPU idle and to invoke rcu_dynticks_task_enter()
just after marking the CPU idle.

The decision to execute these four functions is currently taken in
rcu_irq_enter() and rcu_irq_exit() as follows:

rcu_irq_enter()
/* A conditional branch with ->dynticks_nmi_nesting */
rcu_nmi_enter()
/* A conditional branch with ->dynticks */
/* A conditional branch with ->dynticks_nmi_nesting */

rcu_irq_exit()
/* A conditional branch with ->dynticks_nmi_nesting */
rcu_nmi_exit()
/* A conditional branch with ->dynticks_nmi_nesting */
/* A conditional branch with ->dynticks_nmi_nesting */

rcu_nmi_enter()
/* A conditional branch with ->dynticks */

rcu_nmi_exit()
/* A conditional branch with ->dynticks_nmi_nesting */

This works, but the conditional branches in rcu_irq_enter() and
rcu_irq_exit() are redundant with those in rcu_nmi_enter() and
rcu_nmi_exit(), respectively. Redundant branches are not something
we want in the to/from-idle fastpaths, so this commit refactors
rcu_{nmi,irq}_{enter,exit}() so they use a common inlined function passed
a constant argument as follows:

rcu_irq_enter() inlining rcu_nmi_enter_common(irq=true)
/* A conditional branch with ->dynticks */

rcu_irq_exit() inlining rcu_nmi_exit_common(irq=true)
/* A conditional branch with ->dynticks_nmi_nesting */

rcu_nmi_enter() inlining rcu_nmi_enter_common(irq=false)
/* A conditional branch with ->dynticks */

rcu_nmi_exit() inlining rcu_nmi_exit_common(irq=false)
/* A conditional branch with ->dynticks_nmi_nesting */

The combination of the constant function argument and the inlining allows
the compiler to discard the conditionals that previously controlled
execution of rcu_dynticks_task_exit(), rcu_cleanup_after_idle(),
rcu_prepare_for_idle(), and rcu_dynticks_task_enter(). This reduces both
the to-idle and from-idle path lengths by two conditional branches each,
and improves readability as well.

This commit also changes order of execution from this:

rcu_dynticks_task_exit();
rcu_dynticks_eqs_exit();
trace_rcu_dyntick();
rcu_cleanup_after_idle();

To this:

rcu_dynticks_task_exit();
rcu_dynticks_eqs_exit();
rcu_cleanup_after_idle();
trace_rcu_dyntick();

In other words, the calls to rcu_cleanup_after_idle() and
trace_rcu_dyntick() are reversed. This has no functional effect because
the real concern is whether a given call is before or after the call to
rcu_dynticks_eqs_exit(), and this patch does not change that. Before the
call to rcu_dynticks_eqs_exit(), RCU is not yet watching the current
CPU and after that call RCU is watching.

A similar switch in calling order happens on the idle-entry path, with
similar lack of effect for the same reasons.

Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Applied Steven Rostedt feedback. ]
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 4babd855 19-Jun-2018 Joel Fernandes (Google) <joel@joelfernandes.org>

rcutorture: Add support to detect if boost kthread prio is too low

When rcutorture is built in to the kernel, an earlier patch detects
that and raises the priority of RCU's kthreads to allow rcutorture's
RCU priority boosting tests to succeed.

However, if rcutorture is built as a module, those priorities must be
raised manually via the rcutree.kthread_prio kernel boot parameter.
If this manual step is not taken, rcutorture's RCU priority boosting
tests will fail due to kthread starvation. One approach would be to
raise the default priority, but that risks breaking existing users.
Another approach would be to allow runtime adjustment of RCU's kthread
priorities, but that introduces numerous "interesting" race conditions.
This patch therefore instead detects too-low priorities, and prints a
message and disables the RCU priority boosting tests in that case.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6bea2cc5 16-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove rcutorture test version and sequence number

Back when RCU had a debugfs interface, there was a test version and
sequence number that allowed associating debugfs data with a particular
test run, where the test run started with modprobe and ended with rmmod,
which was how tests were run back on the old ABAT system within IBM.
But rcutorture testing no longer runs on ABAT, and there is no longer an
RCU debugfs interface, so there is no longer any need for test versions
and sequence numbers.

This commit therefore removes the rcutorture_record_test_transition()
and rcutorture_record_progress() functions, and along with them the
rcutorture_testseq and rcutorture_vernum variables that they update.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c7cd161e 19-Jun-2018 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Assign higher prio to RCU threads if rcutorture is built-in

The rcutorture RCU priority boosting tests fail even with CONFIG_RCU_BOOST
set because rcutorture's threads run at the same priority as the default
RCU kthreads (RT class with priority of 1).

This patch checks if RCU torture is built into the kernel and if so,
assigns RT priority 1 to the RCU threads, allowing the rcutorture boost
tests to pass.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 67abb96c 31-May-2018 Byungchul Park <byungchul.park@lge.com>

rcu: Check the range of jiffies_till_{first,next}_fqs when setting them

Currently, the range of jiffies_till_{first,next}_fqs are checked and
adjusted on and on in the loop of rcu_gp_kthread on runtime.

However, it's enough to check them only when setting them, not every
time in the loop. So make them handled on a setting time via sysfs.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 47199a08 28-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add diagnostics for rcutorture writer stall warning

This commit adds any in-the-future ->gp_seq_needed fields to the
diagnostics for an rcutorture writer stall warning message.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b06ae25a 17-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Use RCU CPU stall timeout for rcu_check_gp_start_stall()

Currently, rcu_check_gp_start_stall() waits for one second after the first
request before complaining that a grace period has not yet started. This
was desirable while testing the conversion from ->future_gp_needed[] to
->gp_seq_needed, but it is a bit on the hair-trigger side for production
use under heavy load. This commit therefore makes this wait time be
exactly that of the RCU CPU stall warning, allowing easy adjustment of
both timeouts to suit the distribution or installation at hand.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 51fbb910 17-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove __maybe_unused from rcu_cpu_has_callbacks()

The rcu_cpu_has_callbacks() function is now used in all configurations,
so this commit removes the __maybe_unused.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 95394e69 17-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove "inline" from panic_on_rcu_stall() and rcu_blocking_is_gp()

These functions are in kernel/rcu/tree.c, which is not an include file,
so there is no problem dropping the "inline", especially given that these
functions are nowhere near a fastpath. This commit therefore delegates
the inlining decision to the compiler by dropping the "inline".

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3b57a399 16-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Inline rcu_dynticks_momentary_idle() into its sole caller

The rcu_dynticks_momentary_idle() function is invoked only from
rcu_momentary_dyntick_idle(), and neither function is particularly
large. This commit therefore saves a few lines by inlining
rcu_dynticks_momentary_idle() into rcu_momentary_dyntick_idle().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6f56f714 14-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Improve RCU-tasks naming and comments

The naming and comments associated with some RCU-tasks code make
the faulty assumption that context switches due to cond_resched()
are voluntary. As several people pointed out, this is not the case.
This commit therefore updates function names and comments to better
reflect current reality.

Reported-by: Byungchul Park <byungchul.park@lge.com>
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a7538352 14-May-2018 Joe Perches <joe@perches.com>

rcu: Use pr_fmt to prefix "rcu: " to logging output

This commit also adjusts some whitespace while in the area.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Revert string-breaking %s as requested by Andy Shevchenko. ]


# 07f27570 11-May-2018 Byungchul Park <byungchul.park@lge.com>

rcu: Improve rcu_note_voluntary_context_switch() reporting

We expect a quiescent state of TASKS_RCU when cond_resched_tasks_rcu_qs()
is called, no matter whether it actually be scheduled or not. However,
it currently doesn't report the quiescent state when the task enters
into __schedule() as it's called with preempt = true. So make it report
the quiescent state unconditionally when cond_resched_tasks_rcu_qs() is
called.

And in TINY_RCU, even though the quiescent state of rcu_bh also should
be reported when the tick interrupt comes from user, it doesn't. So make
it reported.

Lastly in TREE_RCU, rcu_note_voluntary_context_switch() should be
reported when the tick interrupt comes from not only user but also idle,
as an extended quiescent state.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Simplify rcutiny portion given no RCU-tasks for !PREEMPT. ]


# f2e2df59 15-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add diagnostics for offline CPUs failing to report QS

CPUs are expected to report quiescent states when coming online and
when going offline, and grace-period initialization is supposed to
handle any race conditions where a CPU's ->qsmask bit is set just after
it goes offline. This commit adds diagnostics for the case where an
offline CPU nevertheless has a grace period waiting on it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fea3f222 15-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Record ->gp_state for both phases of grace-period initialization

Grace-period initialization first processes any recent CPU-hotplug
operations, and then initializes state for the new grace period. These
two phases of initialization are currently not distinguished in debug
prints, but the distinction is valuable in a number of debug situations.
This commit therefore introduces two new values for ->gp_state,
RCU_GP_ONOFF and RCU_GP_INIT, in order to make this distinction.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 57738942 08-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add CPU online/offline state to dump_blkd_tasks()

Interactions between CPU-hotplug operations and grace-period
initialization can result in dump_blkd_tasks(). One of the first
debugging actions in this case is to search back in dmesg to work
out which of the affected rcu_node structure's CPUs are online and to
determine the last CPU-hotplug operation affecting any of those CPUs.
This can be laborious and error-prone, especially when console output
is lost.

This commit therefore causes dump_blkd_tasks() to dump the state of
the affected rcu_node structure's CPUs and the last grace period during
which the last offline and online operation affected each of these CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e05121ba 07-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove CPU-hotplug failsafe from force-quiescent-state code path

Now that quiescent states for newly offlined CPUs are reported either
when that CPU goes offline or at the end of grace-period initialization,
the CPU-hotplug failsafe in the force-quiescent-state code path is no
longer needed.

This commit therefore removes this failsafe.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 17a8212b 03-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove failsafe check for lost quiescent state

Now that quiescent-state reporting is fully event-driven, this commit
removes the check for a lost quiescent state from force_qs_rnp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f34f2f58 03-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Move grace-period pre-init delay after pre-init

The main race with the early part of grace-period initialization appears
to be with CPU hotplug. To more fully open this race window, this commit
moves the rcu_gp_slow() from the beginning of the early initialization
loop to follow that loop, thus widening the race window, especially for
the rcu_node structures that are initialized last. This commit also
expands rcutree.gp_preinit_delay from 3 to 12, giving the same overall
delay in the grace period, but concentrated in the spot where it will
do the most good.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 1e64b15a 25-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix grace-period hangs due to race with CPU offline

Without special fail-safe quiescent-state-propagation checks, grace-period
hangs can result from the following scenario:

1. CPU 1 goes offline.

2. Because CPU 1 is the only CPU in the system blocking the current
grace period, the grace period ends as soon as
rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
returns.

3. At this point, the leaf rcu_node structure's ->lock is no longer
held: rcu_report_qs_rnp() has released it, as it must in order
to awaken the RCU grace-period kthread.

4. At this point, that same leaf rcu_node structure's ->qsmaskinitnext
field still records CPU 1 as being online. This is absolutely
necessary because the scheduler uses RCU (in this case on the
wake-up path while awakening RCU's grace-period kthread), and
->qsmaskinitnext contains RCU's idea as to which CPUs are online.
Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
bit from ->qsmaskinitnext would result in a lockdep-RCU splat
due to RCU being used from an offline CPU.

5. RCU's grace-period kthread awakens, sees that the old grace period
has completed and that a new one is needed. It therefore starts
a new grace period, but because CPU 1's leaf rcu_node structure's
->qsmaskinitnext field still shows CPU 1 as being online, this new
grace period is initialized to wait for a quiescent state from the
now-offline CPU 1.

6. Without the fail-safe force-quiescent-state checks, there would
be no quiescent state from the now-offline CPU 1, which would
eventually result in RCU CPU stall warnings and memory exhaustion.

It would be good to get rid of the special fail-safe quiescent-state
propagation checks, and thus it would be good to fix things so that
the above scenario cannot happen. This commit therefore adds a new
->ofl_lock to the rcu_state structure. This lock is held by rcu_gp_init()
across the applying of buffered online and offline operations to the
rcu_node tree, and it is also held by rcu_cleanup_dying_idle_cpu()
when buffering a new offline operation. This prevents rcu_gp_init()
from acquiring the leaf rcu_node structure's lock during the interval
between when rcu_cleanup_dying_idle_cpu() invokes rcu_report_qs_rnp(),
which releases ->lock and the re-acquisition of that same lock.
This in turn prevents the failure scenario outlined above, and will
hopefully eventually allow removal of the offline-CPU checks from the
force-quiescent-state code path.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ec2c2976 07-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix grace-period hangs from mid-init task resume

Without special fail-safe quiescent-state-propagation checks, grace-period
hangs can result from the following scenario:

1. A task running on a given CPU is preempted in its RCU read-side
critical section.

2. That CPU goes offline, and there are now no online CPUs
corresponding to that CPU's leaf rcu_node structure.

3. The rcu_gp_init() function does the first phase of grace-period
initialization, and sets the aforementioned leaf rcu_node
structure's ->qsmaskinit field to all zeroes. Because there
is a blocked task, it does not propagate the zeroing of either
->qsmaskinit or ->qsmaskinitnext up the rcu_node tree.

4. The task resumes on some other CPU and exits its critical section.
There is no grace period in progress, so the resulting quiescent
state is not reported up the tree.

5. The rcu_gp_init() function does the second phase of grace-period
initialization, which results in the leaf rcu_node structure
being initialized to expect no further quiescent states, but
with that structure's parent expecting a quiescent-state report.

The parent will never receive a quiescent state from this leaf
rcu_node structure, so the grace period will hang, resulting in
RCU CPU stall warnings.

It would be good to get rid of the special fail-safe quiescent-state
propagation checks. This commit therefore checks the leaf rcu_node
structure's ->wait_blkd_tasks field during grace-period initialization.
If this flag is set, the rcu_report_qs_rnp() is invoked to immediately
report the possible quiescent state. While in the neighborhood, this
commit also report quiescent states for any CPUs that went offline between
the two phases of grace-period initialization, thus reducing grace-period
delays and hopefully eventually allowing removal of offline-CPU checks
from the force-quiescent-state code path.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 99990da1 03-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Suppress more involved false-positive preempted-task splats

Consider the following sequence of events in a PREEMPT=y kernel:

1. All but one of the CPUs corresponding to a given leaf rcu_node
structure go offline. Each of these CPUs clears its bit in that
structure's ->qsmaskinitnext field.

2. A new grace period starts, and rcu_gp_init() scans the leaf
rcu_node structures, applying CPU-hotplug changes since the
start of the previous grace period, including those changes in
#1 above. This copies each leaf structure's ->qsmaskinitnext
to its ->qsmask field, which represents the CPUs that this new
grace period will wait on. Each copy operation is done holding
the corresponding leaf rcu_node structure's ->lock, and at the
end of this scan, rcu_gp_init() holds no locks.

3. The last CPU corresponding to #1's leaf rcu_node structure goes
offline, clearing its bit in that structure's ->qsmaskinitnext
field, but not touching the ->qsmaskinit field. Note that
rcu_gp_init() is not currently holding any locks! This CPU does
-not- report a quiescent state because the grace period has not
yet initialized itself sufficiently to have set any bits in any
of the leaf rcu_node structures' ->qsmask fields.

4. The rcu_gp_init() function continues initializing the new grace
period, copying each leaf rcu_node structure's ->qsmaskinit
field to its ->qsmask field while holding the corresponding ->lock.
This sets the ->qsmask bit corresponding to #3's CPU.

5. Before the grace period ends, #3's CPU comes back online.
Because te grace period has not yet done any force-quiescent-state
scans (which would report a quiescent state on behalf of any
offline CPUs), this CPU's ->qsmask bit is still set.

6. A task running on the newly onlined CPU is preempted while in
an RCU read-side critical section. Because this CPU's ->qsmask
bit is net, not only does this task queue itself on the leaf
rcu_node structure's ->blkd_tasks list, it also sets that
structure's ->gp_tasks pointer to reference it.

7. The grace period started in #1 above comes to an end. This
results in rcu_gp_cleanup() being invoked, which, among other
things, checks to make sure that there are no tasks blocking the
just-ended grace period, that is, that all ->gp_tasks pointers
are NULL. The ->gp_tasks pointer corresponding to the task
preempted in #3 above is non-NULL, which results in a splat.

This splat is a false positive. The task's RCU read-side critical
section cannot have begun before the just-ended grace period because
this would mean either: (1) The CPU came online before the grace period
started, which cannot have happened because the grace period started
before that CPU went offline, or (2) The task started its RCU read-side
critical section on some other CPU, but then it would have had to have
been preempted before migrating to this CPU, which would mean that it
would have instead queued itself on that other CPU's rcu_node structure.
RCU's grace periods thus are working correctly. Or, more accurately,
that remaining bugs in RCU's grace periods are elsewhere.

This commit eliminates this false positive by adding code to the end
of rcu_cpu_starting() that reports a quiescent state to RCU, which has
the side-effect of clearing that CPU's ->qsmask bit, preventing the
above scenario. This approach has the added benefit of more promptly
reporting quiescent states corresponding to offline CPUs. Nevertheless,
this commit does -not- remove the need for the force-quiescent-state
scans to check for offline CPUs, given that a CPU might remain offline
indefinitely. And without the checks in the force-quiescent-state scans,
the grace period would also persist indefinitely, which could result in
hangs or memory exhaustion.

Note well that the call to rcu_report_qs_rnp() reporting the quiescent
state must come -after- the setting of this CPU's bit in the leaf rcu_node
structure's ->qsmaskinitnext field. Otherwise, lockdep-RCU will complain
bitterly about quiescent states coming from an offline CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fece2776 02-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Suppress false-positive preempted-task splats

Consider the following sequence of events in a PREEMPT=y kernel:

1. All CPUs corresponding to a given rcu_node structure go offline.
A new grace period starts just after the CPU-hotplug code path
does its synchronize_rcu() for the last CPU, so at least this
CPU is present in that structure's ->qsmask.

2. Before the grace period ends, a CPU comes back online, and not
just any CPU, but the one corresponding to a non-zero bit in
the leaf rcu_node structure's ->qsmask.

3. A task running on the newly onlined CPU is preempted while in
an RCU read-side critical section. Because this CPU's ->qsmask
bit is net, not only does this task queue itself on the leaf
rcu_node structure's ->blkd_tasks list, it also sets that
structure's ->gp_tasks pointer to reference it.

4. The grace period started in #1 above comes to an end. This
results in rcu_gp_cleanup() being invoked, which, among other
things, checks to make sure that there are no tasks blocking the
just-ended grace period, that is, that all ->gp_tasks pointers
are NULL. The ->gp_tasks pointer corresponding to the task
preempted in #3 above is non-NULL, which results in a splat.

This splat is a false positive. The task's RCU read-side critical
section cannot have begun before the just-ended grace period because
this would mean either: (1) The CPU came online before the grace period
started, which cannot have happened because the grace period started
before that CPU was all the way offline, or (2) The task started its
RCU read-side critical section on some other CPU, but then it would
have had to have been preempted before migrating to this CPU, which
would mean that it would have instead queued itself on that other CPU's
rcu_node structure.

This commit eliminates this false positive by adding code to the end
of rcu_cleanup_dying_idle_cpu() that reports a quiescent state to RCU,
which has the side-effect of clearing that CPU's ->qsmask bit, preventing
the above scenario. This approach has the added benefit of more promptly
reporting quiescent states corresponding to offline CPUs.

Note well that the call to rcu_report_qs_rnp() reporting the quiescent
state must come -before- the clearing of this CPU's bit in the leaf
rcu_node structure's ->qsmaskinitnext field. Otherwise, lockdep-RCU
will complain bitterly about quiescent states coming from an offline CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5554788e 15-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Suppress false-positive offline-CPU lockdep-RCU splat

The rcu_lockdep_current_cpu_online() function currently checks only the
RCU-sched data structures to determine whether or not RCU believes that a
given CPU is offline. Unfortunately, there are multiple flavors of RCU,
which means that there is a short window of time during which the various
flavors disagree as to whether or not a given CPU is offline. This can
result in false-positive lockdep-RCU splats in which some other flavor
of RCU tries to do something based on its view that the CPU is online,
only to get hit with a lockdep-RCU splat because RCU-sched instead
believes that the CPU is offline.

This commit therefore changes rcu_lockdep_current_cpu_online() to scan
all RCU flavors and to consider a given CPU to be online if any of the
RCU flavors believe it to be online, thus preventing these false-positive
splats.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 92816435 02-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Prevent useless FQS scan after all CPUs have checked in

The force_qs_rnp() function checks for ->qsmask being all zero, that is,
all CPUs for the current rcu_node structure having already passed through
quiescent states. But with RCU-preempt, this is not sufficient to report
quiescent states further up the tree, so there are further checks that
can initiate RCU priority boosting and also for races with CPU-hotplug
operations. However, if neither of these further checks apply, the code
proceeds to carry out a useless scan of an all-zero ->qsmask.

This commit therefore adds code to release the current rcu_node
structure's lock and continue on to the next rcu_node structure, thereby
avoiding this useless scan.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 91f63ced 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Replace smp_wmb() with smp_store_release() for stall check

This commit gets rid of the smp_wmb() in record_gp_stall_check_time()
in favor of an smp_store_release().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 77cfc7bf 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix typo and add additional debug

This commit fixes a typo and adds some additional debugging to the
message emitted when a task blocking the current grace period is listed
as blocking it when either that grace period ends or the next grace
period begins. This commit also reformats the console message for
readability.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c74859d1 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_report_unblock_qs_rnp() warn on violated preconditions

If rcu_report_unblock_qs_rnp() is invoked on something other than
preemptible RCU or if there are still preempted tasks blocking the
current grace period, something went badly wrong in the caller.
This commit therefore adds WARN_ON_ONCE() to these conditions, but
leaving the legitimate reason for early exit (rnp->qsmask != 0)
unwarned.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8d672fa6 02-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_init_new_rnp() stop upon already-set bit

Currently, rcu_init_new_rnp() walks up the rcu_node combining tree,
setting bits in the ->qsmaskinit fields on the way up. It walks up
unconditionally, regardless of the initial state of these bits. This is
OK because only the corresponding RCU grace-period kthread ever tests
or sets these bits during runtime. However, it is also pointless, and
it increases both memory and lock contention (albeit only slightly), so
this commit stops the walk as soon as an already-set bit is encountered.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c50cbe53 02-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix an obsolete ->qsmaskinit comment

Back in the old days, when grace-period initialization blocked CPU
hotplug, the ->qsmaskinit mask was indeed updated at the time that
a given CPU went offline. However, with the deferral of these updates
until the beginning of the next grace period in commit 0aa04b055e71
("rcu: Process offlining and onlining only at grace-period start"),
it is instead ->qsmaskinitnext that gets updated at that time.

This commit therefore updates the obsolete comment. It also fixes
punctuation while on the topic of comments mentioning ->qsmaskinit.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 962aff03 02-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Clean up handling of tasks blocked across full-rcu_node offline

Commit 0aa04b055e71 ("rcu: Process offlining and onlining only at
grace-period start") deferred handling of CPU-hotplug events until the
start of the next grace period, but consider the following sequence
of events:

1. A task is preempted within an RCU-preempt read-side critical
section.

2. The CPU that this task was running on goes offline, along with all
other CPUs sharing the corresponding leaf rcu_node structure.

3. The task resumes execution.

4. One of those CPUs comes back online before a new grace period starts.

In step 2, the code in the next rcu_gp_init() invocation will (correctly)
defer removing the leaf rcu_node structure from the upper-level bitmasks,
and will (correctly) set that structure's ->wait_blkd_tasks field. During
the ensuing interval, RCU will (correctly) track the tasks preempted on
that structure because they must block any subsequent grace period.

In step 3, the code in rcu_read_unlock_special() will (correctly) remove
the task from the leaf rcu_node structure. From this point forward, RCU
need not pay attention to this structure, at least not until one of the
corresponding CPUs comes back online.

In step 4, the code in the next rcu_gp_init() invocation will
(incorrectly) invoke rcu_init_new_rnp(). This is incorrect because
the corresponding rcu_cleanup_dead_rnp() was never invoked. This is
nevertheless harmless because the upper-level bits are still set.
So, no harm, no foul, right?

At least, all is well until a little further into rcu_gp_init()
invocation, which will notice that there are no longer any tasks blocked
on the leaf rcu_node structure, conclude that there is no longer anything
left over from step 2's offline operation, and will therefore invoke
rcu_cleanup_dead_rnp(). But this invocation of rcu_cleanup_dead_rnp()
is for the beginning of the earlier offline interval, and the previous
invocation of rcu_init_new_rnp() is for the end of that same interval.
That is right, they are invoked out of order.

That cannot be good, can it?

It turns out that this is not a (correctness!) problem because
rcu_cleanup_dead_rnp() checks to see if any of the corresponding CPUs
are online, and refuses to do anything if so. In other words, in the
case where rcu_init_new_rnp() and rcu_cleanup_dead_rnp() execute out of
order, they both have no effect.

But this is at best an accident waiting to happen.

This commit therefore adds logic to rcu_gp_init() so that
rcu_init_new_rnp() and rcu_cleanup_dead_rnp() are always invoked in
order, and so that neither are invoked at all in cases where RCU had to
pay attention to the leaf rcu_node structure during the entire time that
all corresponding CPUs were offline.

And, while in the area, this commit reduces confusion by using formal
parameters rather than local variables that just happen to have the same
value at that particular point in the code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 226ca5e7 23-May-2018 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Identify grace period is in progress as we advance up the tree

There's no need to keep checking the same starting node for whether a
grace period is in progress as we advance up the funnel lock loop. Its
sufficient if we just checked it in the start, and then subsequently
checked the internal nodes as we advanced up the combining tree. This
also makes sense because the grace-period updates propogate from the
root to the leaf, so there's a chance we may find a grace period has
started as we advance up, lets check for the same.

Reported-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# df2bf8f7 23-May-2018 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Use better variable names in funnel locking loop

The funnel locking loop in rcu_start_this_gp uses rcu_root as a
temporary variable while walking the combining tree. This causes a
tiresome exercise of a code reader reminding themselves that rcu_root
may not be root. Lets just call it rnp, and rename other variables as
well to be more appropriate.

Original patch: https://patchwork.kernel.org/patch/10396577/

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix name in comment as well. ]


# b73de91d 20-May-2018 Joel Fernandes <joelaf@google.com>

rcu: Rename the grace-period-request variables and parameters

The name 'c' is used for variables and parameters holding the requested
grace-period sequence number. However it is no longer very meaningful
given the conversions from ->gpnum and (especially) ->completed to
->gp_seq. This commit therefore renames 'c' to 'gp_seq_req'.

Previous patch discussion is at:
https://patchwork.kernel.org/patch/10396579/

Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3d18469a 15-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Regularize resetting of rcu_data wrap indicator

The rcu_data structure's ->gpwrap indicator is currently reset only
when the CPU in question detects a new grace period. This is in theory
sufficient because any CPU that has been out of action for long enough
that its ->gpwrap indicator is set is guaranteed to see both the end
of an old grace period and the start of a new one.

However, the current code leaves a short window during which the ->gpwrap
indicator has been reset but the corresponding ->gp_seq counter has not
yet been brought up to date. This is harmless because interrupts are
disabled, but it is likely to (at the very least) cause confusion.

This commit therefore moves the resetting of ->gpwrap to follow the
updating of ->gp_seq. While in the area, it also resets ->gp_seq_needed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d7219312 15-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Correctly handle grace-period sequence wrap

The new ->gq_seq grace-period sequence numbers must be shifted down,
which give artifacts when these numbers wrap. This commit therefore
enables rcutorture and rcuperf to handle grace-period sequence numbers
even if they do wrap. It does this by allowing a special subtraction
function to be specified, and this function subtracts before shifting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2e3e5e55 15-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_start_this_gp() check for grace period already started

In the old days of ->gpnum and ->completed, the code requesting a new
grace period checked to see if that grace period had already started,
bailing early if so. The new-age ->gp_seq approach instead checks
whether the grace period has already finished. A compensating change
pushed the requested grace period down to the bottom of the tree, thus
reducing lock contention and even eliminating it in some cases. But why
not further reduce contention, especially on large systems, by doing both,
especially given that the cost of doing both is extremely small?

This commit therefore adds a new rcu_seq_started() function that checks
whether a specified grace period has already started. It then uses
this new function in place of rcu_seq_done() in the rcu_start_this_gp()
function's funnel locking code.

Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5ca0905f 13-May-2018 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Fix cpustart tracepoint gp_seq number

The "cpustart" trace event shows a stale gp_seq. This is because it uses
rdp->gp_seq, which is updated only at the end of the __note_gp_changes()
function. This commit therefore instead uses rnp->gp_seq.

An alternative fix would be to update rdp->gp_seq earlier, but this would
break RCU's detection of the beginning of a new-to-this-CPU grace period.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5b55072f 13-May-2018 Joel Fernandes (Google) <joel@joelfernandes.org>

rcu: Produce last "CleanupMore" trace only if late-breaking request

Currently Tree RCU's clean-up code emits a "CleanupMore" trace event in
response to late-arriving grace-period requests even if the grace period
was already requested. This makes "CleanupMore" show up an extra time (in
addition to once for each rcu_node structure that was previously marked
with the request), and for no good reason. This commit therefore avoids
emitting this trace message unless the the only request for this next
grace period arrived during or after the cleanup scan of the rcu_node
structures.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a2165e41 12-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't funnel-lock above leaf node if GP in progress

The old grace-period start code would acquire only the leaf's rcu_node
structure's ->lock if that structure believed that a grace period was
in progress. The new code advances to the leaf's parent in this case,
needlessly acquiring then leaf's parent's ->lock. This commit therefore
checks the grace-period state after marking the leaf with the need for
the specified grace period, and if the leaf believes that a grace period
is in progress, takes an early exit.

Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Add "Startedleaf" tracing as suggested by Joel Fernandes. ]


# e44e73ca 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make simple callback acceleration refer to rdp->gp_seq_needed

Now that the rcu_data structure contains ->gp_seq_needed, create an
rcu_accelerate_cbs_unlocked() helper function that locklessly checks to
see if new callbacks' required grace period has already been requested.
If so, update the callback list locally and again locklessly. (Though
interrupts must be and are disabled to avoid racing with conflicting
updates in interrupt handlers.)

Otherwise, call rcu_accelerate_cbs() as before.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ff3bb6f4 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove ->gpnum and ->completed

Now that everything has been converted to use ->gp_seq instead of
->gpnum and ->completed, this commit removes ->gpnum and ->completed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fee5997c 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert rcu_fqs tracepoint to ->gp_seq

This commit makes the rcu_fqs tracepoint use ->gp_seq instead of ->gpnum.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# db023296 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert rcu_quiescent_state_report tracepoint to ->gp_seq

This commit makes the rcu_quiescent_state_report tracepoint use ->gp_seq
instead of ->gpnum.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# abd13fdd 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert rcu_future_grace_period tracepoint to gp_seq

This commit makes the rcu_future_grace_period tracepoint use gp_seq
instead of ->gpnum and ->completed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 477351f7 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert rcu_grace_period tracepoint to gp_seq

This commit makes the rcu_grace_period tracepoint use gp_seq instead
of ->gpnum or ->completed. It also introduces a "cpuofl-bgp" string to
less obscurely indicate when a CPU has gone offline while a grace period
is waiting on it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ab5e869c 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_nocb_wait_gp() check if GP already requested

This commit makes rcu_nocb_wait_gp() check rdp->gp_seq_needed to see
if the current CPU already knows about the needed grace period having
already been requested. If so, it avoids acquiring the corresponding
leaf rcu_node structure's ->lock, thus decreasing contention. This
optimization is intended for cases where either multiple leader rcuo
kthreads are running on the same CPU or these kthreads are running on
a non-offloaded (e.g., housekeeping) CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Move lock release past "if" as suggested by Joel Fernandes. ]
[ paulmck: Fix caching of furthest-future requested grace period. ]


# 7a1d0f23 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Move from ->need_future_gp[] to ->gp_seq_needed

One problem with the ->need_future_gp[] array is that the grace-period
assignment of each element changes as the grace periods complete.
This means that it is necessary to hold a lock when checking this
array to learn if a given grace period has already been requested.
This increase lock contention, which is the opposite of helpful.
This commit therefore replaces the ->need_future_gp[] with a single
->gp_seq_needed value and keeps it updated in the rcu_data structure.

This will enable reliable lockless checking of whether or not a given
grace period has already been requested.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# aebc8264 01-May-2018 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Convert rcutorture_get_gp_data() to ->gp_seq

SRCU has long used ->srcu_gp_seq, and now RCU uses ->gp_seq. This
commit therefore moves the rcutorture_get_gp_data() function from
a ->gpnum / ->completed pair to ->gp_seq.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 471f87c3 30-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make RCU CPU stall warnings use ->gp_seq

This commit makes the RCU CPU stall-warning code in print_other_cpu_stall(),
print_cpu_stall(), and check_cpu_stall() use ->gp_seq instead of ->gpnum
and ->completed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 29365e56 30-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert grace-period requests to ->gp_seq

This commit converts the grace-period request code paths from ->completed
and ->gpnum to ->gp_seq. The need_future_gp_element() macro encapsulates
the shift operation required to use ->gp_seq as an index to the
->need_future_gp[] array. The rcu_cbs_completed() function is removed
in favor of the rcu_seq_snap() function. The rcu_start_this_gp()
gets some temporary consistency checks and uses rcu_seq_done(),
rcu_seq_current(), rcu_seq_state(), and rcu_gp_in_progress() in place
of the earlier open-coded comparisons of ->gpnum and ->completed.
The rcu_future_gp_cleanup() function replaces use of ->completed
with ->gp_seq. The rcu_accelerate_cbs() function replaces a call to
rcu_cbs_completed() with one to rcu_seq_snap(). The rcu_advance_cbs()
function replaces an access to >completed with one to ->gp_seq and adds
some temporary warnings. The rcu_nocb_wait_gp() function replaces a
call to rcu_cbs_completed() with one to rcu_seq_snap() and an open-coded
comparison with rcu_seq_done().

The temporary warnings will be removed when the various ->gpnum and
->completed fields are removed. Their purpose is to locate code who
might still be using ->gpnum and ->completed. (Much easier that way
than trying to trace down the causes of too-short grace periods and
grace-period hangs!)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d43a5d32 28-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert ->completedqs to ->gp_seq

This commit switches the quiescent-state no-backtracking checks from
->gpnum and ->completed to ->gp_seq.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8aa670cd 28-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert ->rcu_iw_gpnum to ->gp_seq

This commit switches the interrupt-disabled detection mechanism to
->gp_seq. This mechanism is used as part of RCU CPU stall warnings,
and detects cases where the stall is due to a CPU having interrupts
disabled.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ba04107f 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_gp_in_progress() to ->gp_seq

This commit makes rcu_gp_in_progress() use ->gp_seq instead of
->completed and ->gpnum. The READ_ONCE() invocations are buried
in rcu_seq_current().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e05720b0 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_implicit_dynticks_qs() to ->gp_seq

This commit makes rcu_implicit_dynticks_qs() use ->gp_seq, with the
exception of tracing, which will be converted later.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a66ae8ae 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert rcu_gpnum_ovf() to ->gp_seq

This commit converts rcu_gpnum_ovf() to use ->gp_seq instead of ->gpnum.
Same size unsigned long, so same approach.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 67e14c1e 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Move RCU's grace-period-change code to ->gp_seq

This commit moves __note_gp_changes(), note_gp_changes(), and
__rcu_pending() to ->gp_seq, creating new rcu_seq_completed_gp() and
rcu_seq_new_gp() functions for this purpose.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Reinstate "cpuend: trace as suggested by Joel Fernandes. ]


# e4be81a2 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert conditional grace-period primitives to ->gp_seq

This commit converts get_state_synchronize_rcu(), cond_synchronize_rcu(),
get_state_synchronize_sched(), and cond_synchronize_sched() from ->gpnum
and ->completed to ->gp_seq. Note that this also introduces a full
memory barrier in the already-done paths off cond_synchronize_rcu() and
cond_synchronize_sched(), as work with LKMM indicates that the earlier
smp_load_acquire() were insufficiently strong in some situations where
these two functions were called just as the grace period ended. In such
cases, these two functions would not gain the benefit of memory ordering
at the end of the grace period.

Please note that the performance impact is negligible, as you shouldn't
be using either function anywhere near a fastpath in any case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c9a24e2d 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make quiescent-state reporting use ->gp_seq

This commit switches the functions reporting quiescent states from
use of ->gpnum to ->gp_seq. In either case, the point is to handle
races where a given grace period ends before a quiescent state can
be reported. Failing to catch these races would result in too-short
grace periods, hence the checking.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 78c5a67f 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert rcu_check_gp_kthread_starvation() to GP sequence number

This commit switches rcu_check_gp_kthread_starvation() from printing
->gpnum and ->completed to printing ->gp_seq upon detecting a starving
RCU grace-period kthread during an RCU CPU stall warning.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 17ef2fe9 27-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcutorture's batches-completed API use ->gp_seq

The rcutorture test invokes rcu_batches_started(),
rcu_batches_completed(), rcu_batches_started_bh(),
rcu_batches_completed_bh(), rcu_batches_started_sched(), and
rcu_batches_completed_sched() to do grace-period consistency checks,
and rcuperf uses the _completed variants for statistics.
These functions use ->gpnum and ->completed. This commit therefore
replaces them with rcu_get_gp_seq(), rcu_bh_get_gp_seq(), and
rcu_sched_get_gp_seq(), adjusting rcutorture and rcuperf to make
use of them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# dee4f422 26-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_gp_slow() to ->gp_seq

This commit moves rcu_gp_slow() to ->gp_seq. This function only uses
the grace-period number to modulate delay, so rcu_seq_ctr(rsp->gp_seq)
gets the same effect, at least in cases where the delay is to happen
more than four times per wrap of an unsigned long.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# de30ad51 26-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Introduce grace-period sequence numbers

This commit adds grace-period sequence numbers (->gp_seq) to the
rcu_state, rcu_node, and rcu_data structures, and updates them.
It also checks for consistency between rsp->gpnum and rsp->gp_seq.
These ->gp_seq counters will eventually replace the existing ->gpnum
and ->completed counters, allowing a single memory access to determine
whether or not a grace period is in progress and if so, which one.
This in turn will enable changes that will reduce ->lock contention on
the leaf rcu_node structures.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 18390aea 22-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_gp_cleanup() write only once to ->gp_flags

At the end of rcu_gp_cleanup(), if another grace period is needed, but
not via rcu_accelerate_cbs(), the ->gp_flags field is written twice,
once when making the new grace-period request, and once when clearing
all other types of requests. This commit therefore adds an else-clause
to avoid this double write.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 26d950a9 21-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Diagnostics for grace-period startup hangs

This commit causes a splat if RCU is idle and a request for a new grace
period is ignored for more than one second. This splat normally indicates
that some code path asked for a new grace period, but failed to wake up
the RCU grace-period kthread.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix bug located by Dan Carpenter and his static checker. ]
[ paulmck: Fix self-deadlock bug located 0day test robot. ]
[ paulmck: Disable unless CONFIG_PROVE_RCU=y. ]


# 8c42b1f3 09-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Exclude near-simultaneous RCU CPU stall warnings

There is a two-jiffy delay between the time that a CPU will self-report
an RCU CPU stall warning and the time that some other CPU will report a
warning on behalf of the first CPU. This has worked well in the past,
but on busy systems, it is possible for the two warnings to overlap,
which makes interpreting them extremely difficult.

This commit therefore uses a cmpxchg-based timing decision that
allows only one report in a given one-minute period (assuming default
stall-warning Kconfig parameters). This approach will of course fail
if you are seeing minute-long vCPU preemption, but in that case the
overlapping RCU CPU stall warnings are the least of your worries.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4bc8d555 27-Nov-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add debugging info to assertion

The WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp()) in
rcu_gp_cleanup() triggers (inexplicably, of course) every so often.
This commit therefore extracts more information.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b3dae109 12-Jun-2018 Peter Zijlstra <peterz@infradead.org>

sched/swait: Rename to exclusive

Since swait basically implemented exclusive waits only, make sure
the API reflects that.

$ git grep -l -e "\<swake_up\>"
-e "\<swait_event[^ (]*"
-e "\<prepare_to_swait\>" | while read file;
do
sed -i -e 's/\<swake_up\>/&_one/g'
-e 's/\<swait_event[^ (]*/&_exclusive/g'
-e 's/\<prepare_to_swait\>/&_exclusive/g' $file;
done

With a few manual touch-ups.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: bigeasy@linutronix.de
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180612083909.261946548@infradead.org


# f64c6013 22-May-2018 Peter Zijlstra <peterz@infradead.org>

rcu/x86: Provide early rcu_cpu_starting() callback

The x86/mtrr code does horrific things because hardware. It uses
stop_machine_from_inactive_cpu(), which does a wakeup (of the stopper
thread on another CPU), which uses RCU, all before the CPU is onlined.

RCU complains about this, because wakeups use RCU and RCU does
(rightfully) not consider offline CPUs for grace-periods.

Fix this by initializing RCU way early in the MTRR case.

Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Add !SMP support, per 0day Test Robot report. ]


# a458360a 19-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Drop early GP request check from rcu_gp_kthread()

Now that grace-period requests use funnel locking and now that they
set ->gp_flags to RCU_GP_FLAG_INIT even when the RCU grace-period
kthread has not yet started, rcu_gp_kthread() no longer needs to check
need_any_future_gp() at startup time. This commit therefore removes
this check.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# c1935209 12-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Simplify and inline cpu_needs_another_gp()

Now that RCU no longer relies on failsafe checks, cpu_needs_another_gp()
can be greatly simplified. This simplification eliminates the last
call to rcu_future_needs_gp() and to rcu_segcblist_future_gp_needed(),
both of which which can then be eliminated. And then, because
cpu_needs_another_gp() is called only from __rcu_pending(), it can be
inlined and eliminated.

This commit carries out the simplification, inlining, and elimination
called out above.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 384f77f4 12-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp()

All of the cpu_needs_another_gp() function's checks (except for
newly arrived callbacks) have been subsumed into the rcu_gp_cleanup()
function's scan of the rcu_node tree. This commit therefore drops the
call to cpu_needs_another_gp(). The check for newly arrived callbacks
is supplied by rcu_accelerate_cbs(). Any needed advancing (as in the
earlier rcu_advance_cbs() call) will be supplied when the corresponding
CPU becomes aware of the end of the now-completed grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 665f08f1 19-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_start_this_gp() check for out-of-range requests

If rcu_start_this_gp() is invoked with a requested grace period more
than three in the future, then either the ->need_future_gp[] array
needs to be bigger or the caller needs to be repaired. This commit
therefore adds a WARN_ON_ONCE() checking for this condition.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 360e0da6 12-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add funnel locking to rcu_start_this_gp()

The rcu_start_this_gp() function had a simple form of funnel locking that
used only the leaves and root of the rcu_node tree, which is fine for
systems with only a few hundred CPUs, but sub-optimal for systems having
thousands of CPUs. This commit therefore adds full-tree funnel locking.

This variant of funnel locking is unusual in the following ways:

1. The leaf-level rcu_node structure's ->lock is held throughout.
Other funnel-locking implementations drop the leaf-level lock
before progressing to the next level of the tree.

2. Funnel locking can be started at the root, which is convenient
for code that already holds the root rcu_node structure's ->lock.
Other funnel-locking implementations start at the leaves.

3. If an rcu_node structure other than the initial one believes
that a grace period is in progress, it is not necessary to
go further up the tree. This is because grace-period cleanup
scans the full tree, so that marking the need for a subsequent
grace period anywhere in the tree suffices -- but only if
a grace period is currently in progress.

4. It is possible that the RCU grace-period kthread has not yet
started, and this case must be handled appropriately.

However, the general approach of using a tree to control lock contention
is still in place.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 41e80595 12-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_start_future_gp() caller select grace period

The rcu_accelerate_cbs() function selects a grace-period target, which
it uses to have rcu_segcblist_accelerate() assign numbers to recently
queued callbacks. Then it invokes rcu_start_future_gp(), which selects
a grace-period target again, which is a bit pointless. This commit
therefore changes rcu_start_future_gp() to take the grace-period target as
a parameter, thus avoiding double selection. This commit also changes
the name of rcu_start_future_gp() to rcu_start_this_gp() to reflect
this change in functionality, and also makes a similar change to the
name of trace_rcu_future_gp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# d5cd9685 12-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp()

The rcu_start_gp_advanced() is invoked only from rcu_start_future_gp() and
much of its code is redundant when invoked from that context. This commit
therefore inlines rcu_start_gp_advanced() into rcu_start_future_gp(),
then removes rcu_start_gp_advanced().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# a824a287 19-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Clear request other than RCU_GP_FLAG_INIT at GP end

Once the grace period has ended, any RCU_GP_FLAG_FQS requests are
irrelevant: The grace period has ended, so there is no longer any
point in forcing quiescent states in order to try to make it end sooner.
This commit therefore causes rcu_gp_cleanup() to clear any bits other
than RCU_GP_FLAG_INIT from ->gp_flags at the end of the grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# a508aa59 19-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Cleanup, don't put ->completed into an int

It is true that currently only the low-order two bits are used, so
there should be no problem given modern machines and compilers, but
good hygiene and maintainability dictates use of an unsigned long
instead of an int. This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# bd7af846 11-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs()

The __rcu_process_callbacks() function currently checks to see if
the current CPU needs a grace period and also if there is any other
reason to kick off a new grace period. This is one of the fail-safe
checks that has been rendered unnecessary by the changes that increase
the accuracy of rcu_gp_cleanup()'s estimate as to whether another grace
period is required. Because this particular fail-safe involved acquiring
the root rcu_node structure's ->lock, which has seen excessive contention
in real life, this fail-safe needs to go.

However, one check must remain, namely the check for newly arrived
RCU callbacks that have not yet been associated with a grace period.
One might hope that the checks in __note_gp_changes(), which is invoked
indirectly from rcu_check_quiescent_state(), would suffice, but this
function won't be invoked at all if RCU is idle. It is therefore necessary
to replace the fail-safe checks with a simpler check for newly arrived
callbacks during an RCU idle period, which is exactly what this commit
does. This change removes the final call to rcu_start_gp(), so this
function is removed as well.

Note that lockless use of cpu_needs_another_gp() is racy, but that
these races are harmless in this case. If RCU really is idle, the
values will not change, so the return value from cpu_needs_another_gp()
will be correct. If RCU is not idle, the resulting redundant call to
rcu_accelerate_cbs() will be harmless, and might even have the benefit
of reducing grace-period latency a bit.

This commit also moves interrupt disabling into the "if" statement to
improve real-time response a bit.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# a6058d85 11-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition

When __call_rcu_core() notices excessive numbers of callbacks pending
on the current CPU, we know that at least one of them is not yet
classified, namely the one that was just now queued. Therefore, it
is not necessary to invoke rcu_start_gp() and thus not necessary to
acquire the root rcu_node structure's ->lock. This commit therefore
replaces the rcu_start_gp() with rcu_accelerate_cbs(), thus replacing
an acquisition of the root rcu_node structure's ->lock with that of
this CPU's leaf rcu_node structure.

This decreases contention on the root rcu_node structure's ->lock.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# ec4eacce 22-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_migrate_callbacks wake GP kthread when needed

The rcu_migrate_callbacks() function invokes rcu_advance_cbs()
twice, ignoring the return value. This is OK at pressent because of
failsafe code that does the wakeup when needed. However, this failsafe
code acquires the root rcu_node structure's lock frequently, while
rcu_migrate_callbacks() does so only once per CPU-offline operation.

This commit therefore makes rcu_migrate_callbacks()
wake up the RCU GP kthread when either call to rcu_advance_cbs()
returns true, thus removing need for the failsafe code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 6f576e28 18-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert ->need_future_gp[] array to boolean

There is no longer any need for ->need_future_gp[] to count the number of
requests for future grace periods, so this commit converts the additions
to assignments to "true" and reduces the size of each element to one byte.
While we are in the area, fix an obsolete comment.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 0ae94e00 18-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_future_needs_gp() check all ->need_future_gps[] elements

Currently, the rcu_future_needs_gp() function checks only the current
element of the ->need_future_gps[] array, which might miss elements that
were offset from the expected element, for example, due to races with
the start or the end of a grace period. This commit therefore makes
rcu_future_needs_gp() use the need_any_future_gp() macro to check all
of the elements of this array.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# fb31340f 12-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_gp_cleanup() more accurately predict need for new GP

Currently, rcu_gp_cleanup() scans the rcu_node tree in order to reset
state to reflect the end of the grace period. It also checks to see
whether a new grace period is needed, but in a number of cases, rather
than directly cause the new grace period to be immediately started, it
instead leaves the grace-period-needed state where various fail-safes
can find it. This works fine, but results in higher contention on the
root rcu_node structure's ->lock, which is undesirable, and contention
on that lock has recently become noticeable.

This commit therefore makes rcu_gp_cleanup() immediately start a new
grace period if there is any need for one.

It is quite possible that it will later be necessary to throttle the
grace-period rate, but that can be dealt with when and if.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 5fe0a562 11-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_gp_kthread() check for early-boot activity

The rcu_gp_kthread() function immediately sleeps waiting to be notified
of the need for a new grace period, which currently works because there
are a number of code sequences that will provide the needed wakeup later.
However, some of these code sequences need to acquire the root rcu_node
structure's ->lock, and contention on that lock has started manifesting.
This commit therefore makes rcu_gp_kthread() check for early-boot activity
when it starts up, omitting the initial sleep in that case.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# c91a8675 18-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add accessor macros for the ->need_future_gp[] array

Accessors for the ->need_future_gp[] array are currently open-coded,
which makes them difficult to change. To improve maintainability, this
commit adds need_future_gp_mask() to compute the indexing mask from the
array size, need_future_gp_element() to access the element corresponding
to the specified grace-period number, and need_any_future_gp() to
determine if any future grace period is needed. This commit also applies
need_future_gp_element() to existing open-coded single-element accesses.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 825a9911 11-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_start_future_gp()'s grace-period check more precise

The rcu_start_future_gp() function uses a sloppy check for a grace
period being in progress, which works today because there are a number
of code sequences that resolve the resulting races. However, some of
these race-resolution code sequences must acquire the root rcu_node
structure's ->lock, and contention on that lock has started manifesting.
This commit therefore makes rcu_start_future_gp() check more precise,
eliminating the sloppy lockless check of the rcu_state structure's ->gpnum
and ->completed fields. The effect is that rcu_start_future_gp() will
sometimes unnecessarily attempt to start a new grace period, but this
overhead will be reduced later using funnel locking.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 9036c2ff 10-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Improve non-root rcu_cbs_completed() accuracy

When rcu_cbs_completed() is invoked on a non-root rcu_node structure,
it unconditionally assumes that two grace periods must complete before
the callbacks at hand can be invoked. This is overly conservative because
if that non-root rcu_node structure believes that no grace period is in
progress, and if the corresponding rcu_state structure's ->gpnum field
has not yet been incremented, then these callbacks may safely be invoked
after only one grace period has completed.

This change is required to permit grace-period start requests to use
funnel locking, which is in turn permitted to reduce root rcu_node ->lock
contention, which has been observed by Nick Piggin. Furthermore, such
contention will likely be increased by the merging of RCU-bh, RCU-preempt,
and RCU-sched, so it makes sense to take steps to decrease it.

This commit therefore improves the accuracy of rcu_cbs_completed() when
invoked on a non-root rcu_node structure as described above.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 5b4c11d5 13-Apr-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Add leaf-node macros

This commit adds rcu_first_leaf_node() that returns a pointer to
the first leaf rcu_node structure in the specified RCU flavor and an
rcu_is_leaf_node() that returns true iff the specified rcu_node structure
is a leaf. This commit also uses these macros where appropriate.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# cee43939 02-Mar-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Rename cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs()

Commit e31d28b6ab8f ("trace: Eliminate cond_resched_rcu_qs() in favor
of cond_resched()") substituted cond_resched() for the earlier call
to cond_resched_rcu_qs(). However, the new-age cond_resched() does
not do anything to help RCU-tasks grace periods because (1) RCU-tasks
is only enabled when CONFIG_PREEMPT=y and (2) cond_resched() is a
complete no-op when preemption is enabled. This situation results
in hangs when running the trace benchmarks.

A number of potential fixes were discussed on LKML
(https://lkml.kernel.org/r/20180224151240.0d63a059@vmware.local.home),
including making cond_resched() not be a no-op; making cond_resched()
not be a no-op, but only when running tracing benchmarks; reverting
the aforementioned commit (which works because cond_resched_rcu_qs()
does provide an RCU-tasks quiescent state; and adding a call to the
scheduler/RCU rcu_note_voluntary_context_switch() function. All were
deemed unsatisfactory, either due to added cond_resched() overhead or
due to magic functions inviting cargo culting.

This commit renames cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs(),
which provides a clear hint as to what this function is doing and
why and where it should be used, and then replaces the call to
cond_resched() with cond_resched_tasks_rcu_qs() in the trace benchmark's
benchmark_event_kthread() function.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# 25f3d7ef 01-Feb-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Parallelize expedited grace-period initialization

The latency of RCU expedited grace periods grows with increasing numbers
of CPUs, eventually failing to be all that expedited. Much of the growth
in latency is in the initialization phase, so this commit uses workqueues
to carry out this initialization concurrently on a rcu_node-by-rcu_node
basis.

This change makes use of a new rcu_par_gp_wq because flushing a work
item from another work item running from the same workqueue can result
in deadlock.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>


# ad7c946b 08-Jan-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Create RCU-specific workqueues with rescuers

RCU's expedited grace periods can participate in out-of-memory deadlocks
due to all available system_wq kthreads being blocked and there not being
memory available to create more. This commit prevents such deadlocks
by allocating an RCU-specific workqueue_struct at early boot time, and
providing it with a rescuer to ensure forward progress. This uses the
shiny new init_rescuer() function provided by Tejun (but indirectly).

This commit also causes SRCU to use this new RCU-specific
workqueue_struct. Note that SRCU's use of workqueues never blocks them
waiting for readers, so this should be safe from a forward-progress
viewpoint. Note that this moves SRCU from system_power_efficient_wq
to a normal workqueue. In the unlikely event that this results in
measurable degradation, a separate power-efficient workqueue will be
creates for SRCU.

Reported-by: Prateek Sood <prsood@codeaurora.org>
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>


# a32e01ee 17-Jan-2018 Matthew Wilcox <willy@infradead.org>

rcu: Use wrapper for lockdep asserts

Commits c0b334c5bfa9 and ea9b0c8a26a2 introduced new sparse warnings
by accessing rcu_node->lock directly and ignoring the __private
marker. Introduce a new wrapper and use it. Also fix a similar problem
in srcutree.c introduced by a3883df3935e.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d07aee2c 11-Jan-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: More clearly identify grace-period kthread stack dump

It is not always obvious that the stack dump from a starved grace-period
kthread isn't instead that of a CPU stalling the current grace period.
This commit therefore adds a pr_err() flagging these dumps.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d62df573 10-Jan-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove obsolete force-quiescent-state statistics for debugfs

The debugfs interface displayed statistics on RCU-pending checks but
this interface has since been removed. This commit therefore removes the
no-longer-used rcu_state structure's ->n_force_qs_lh and ->n_force_qs_ngp
fields along with their updates. (Though the ->n_force_qs_ngp field
was actually not used at all, embarrassingly enough.)

If this information proves necessary in the future, the corresponding
event traces will be added.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 01c495f7 10-Jan-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove obsolete __rcu_pending() statistics for debugfs

The debugfs interface displayed statistics on RCU-pending checks
but this interface has since been removed. This commit therefore
removes the no-longer-used rcu_data structure's ->n_rcu_pending,
->n_rp_core_needs_qs, ->n_rp_report_qs, ->n_rp_cb_ready,
->n_rp_cpu_needs_gp, ->n_rp_gp_completed, ->n_rp_gp_started,
->n_rp_nocb_defer_wakeup, and ->n_rp_need_nothing fields along with
their updates.

If this information proves necessary in the future, the corresponding
event traces will be added.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 62df63e0 10-Jan-2018 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove obsolete callback-invocation statistics for debugfs

The debugfs interface displayed statistics on RCU callback invocation but
this interface has since been removed. This commit therefore removes the
no-longer-used rcu_data structure's ->n_cbs_invoked and ->n_nocbs_invoked
fields along with their updates.

If this information proves necessary in the future, the corresponding
event traces will be added.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 398953e6 22-Nov-2017 Lihao Liang <lianglihao@huawei.com>

rcu: Remove unnecessary spinlock in rcu_boot_init_percpu_data()

Since rcu_boot_init_percpu_data() is only called at boot time,
there is no data race and spinlock is not needed.

Signed-off-by: Lihao Liang <lianglihao@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# efd88b02 19-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add comment giving debug strategy for double call_rcu()

The following statement has for some reason proven non-intuitive:

WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) != (count == 0));

This commit therefore adds a comment that states that this warning
usually triggers in response to a double call_rcu(), which is sort
of like a double free. The comment also suggests building with
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y to track down the double call_rcu().

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e68bbb26 05-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Simplify rcu_eqs_{enter,exit}() non-idle task debug code

The code that checks for non-idle non-nohz_idle-usermode tasks invoking
rcu_eqs_enter() and rcu_eqs_exit() prints a considerable quantity of
helpful information. However, these checks fire rarely, so the extra
complexity is no longer worth it. This commit therefore replaces this
debug code with simple WARN_ON_ONCE() statements.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9dd238e2 05-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Fold rcu_eqs_exit_common() into rcu_eqs_exit()

There is now only one call to rcu_eqs_exit_common() and there is no other
reason to keep it separate. This commit therefore inlines it into its
sole call site, saving a few lines of code in the process.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 215bba9f 05-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Fold rcu_eqs_enter_common() into rcu_eqs_enter()

There is now only one call to rcu_eqs_enter_common() and there is no other
reason to keep it separate. This commit therefore inlines it into its
sole call site, saving a few lines of code in the process.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2342172f 05-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid ->dynticks_nesting store tearing

Although ->dynticks_nesting is updated only by process level, it is
accessed from hardirq to check for interrupt-from-idle quiescent states.
Store tearing is thus possible, so this commit applies WRITE_ONCE()
to ->dynticks_nesting stores.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 914955e1 05-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Stop duplicating lockdep checks in RCU's idle-entry code

The three RCU_LOCKDEP_WARN() calls in rcu_eqs_enter_common() are
redundant with other lockdep checks, so this commit removes them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# dec98900 04-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add ->dynticks field to rcu_dyntick trace event

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 84585aa8 04-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Shrink ->dynticks_{nmi_,}nesting from long long to long

Because the ->dynticks_nesting field now only contains the process-based
nesting level instead of a value encoding both the process nesting level
and the irq "nesting" level, we no longer need a long long, even on
32-bit systems. This commit therefore changes both the ->dynticks_nesting
and ->dynticks_nmi_nesting fields to long.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# bd2b879a 04-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add tracing to irq/NMI dyntick-idle transitions

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 844ccdd7 03-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate rcu_irq_enter_disabled()

Now that the irq path uses the rcu_nmi_{enter,exit}() algorithm,
rcu_irq_enter() and rcu_irq_exit() may be used from any context. There is
thus no need for rcu_irq_enter_disabled() and for the checks using it.
This commit therefore eliminates rcu_irq_enter_disabled().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 51a1fd30 03-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Make ->dynticks_nesting be a simple counter

Now that ->dynticks_nesting counts only process-level dyntick-idle
entry and exit, there is no need for the elaborate segmented counter
with its guard fields and overflow checking. This commit therefore
makes ->dynticks_nesting be a simple counter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 58721f5d 03-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Define rcu_irq_{enter,exit}() in terms of rcu_nmi_{enter,exit}()

RCU currently uses two different mechanisms for tracking irqs and NMIs.
This is unnecessary complexity: Given that NMIs can nest and given that
RCU's tracking handles such nesting, the NMI tracking mechanism can also
be used to track irqs. This commit therefore defines rcu_irq_enter()
in terms of rcu_nmi_enter() and rcu_irq_exit() in terms of rcu_nmi_exit().

Unfortunately, callers must still distinguish between the irq and NMI
functions because additional actions are taken when an irq interrupts
idle or nohz_full usermode execution, and these actions cannot always
be taken from NMI handlers.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6136d6e4 03-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Clamp ->dynticks_nmi_nesting at eqs entry/exit

In preparation for merging dyntick-idle irq handling into the NMI
algorithm, clamp ->dynticks_nmi_nesting value to allow for interrupts
that enter but never leave and vice versa.

It is important that the clamping happen outside of the extended quiescent
state. Otherwise, there will be short windows where irqs and NMIs fail
to convince RCU to start watching.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fd581a91 02-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_nmi_{enter,exit}() to prepare for consolidation

This is a code-motion-only commit that prepares to define rcu_irq_enter()
in terms of rcu_nmi_enter() and rcu_irq_exit() in terms of rcu_irq_exit().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a0eb22bf 02-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Reduce dyntick-idle state space

Both extended-quiescent-state entry and exit first update the nesting
counter and then adjust the dyntick-idle state. This means that there
are four states: (1) Both nesting and dyntick idle indicate idle,
(2) Nesting indicates idle but dyntick idle does not, (3) Nesting indicates
non-idle and dyntick idle does not, and (4) Both nesting and dyntick
idle indicate non-idle. This commit simplifies the state space by
eliminating #3, reversing the order of updates on exit from extended
quiescent state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2e672ab2 02-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid ->dynticks_nmi_nesting store tearing

NMIs can nest, and store tearing could in theory happen on carries
from one byte to the next. This commit therefore adds the WRITE_ONCE()
macros preventing this.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b04db8e1 06-Nov-2017 Frederic Weisbecker <frederic@kernel.org>

rcu: Use lockdep to assert IRQs are disabled/enabled

Lockdep now has an integrated IRQs disabled/enabled sanity check. Just
use it instead of the ad-hoc RCU version.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-15-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 27fdb35f 19-Oct-2017 Paul E. McKenney <paulmck@kernel.org>

doc: Fix various RCU docbook comment-header problems

Because many of RCU's files have not been included into docbook, a
number of errors have accumulated. This commit fixes them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c0da313e 22-Sep-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add extended-quiescent-state testing advice

If you add or remove calls to rcu_idle_enter(), rcu_user_enter(),
rcu_irq_exit(), rcu_irq_exit_irqson(), rcu_idle_exit(), rcu_user_exit(),
rcu_irq_enter(), rcu_irq_enter_irqson(), rcu_nmi_enter(), or
rcu_nmi_exit(), you should run a full set of tests on a kernel built
with CONFIG_RCU_EQS_DEBUG=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9b9500da 17-Aug-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Make RCU CPU stall warnings check for irq-disabled CPUs

One common question upon seeing an RCU CPU stall warning is "did
the stalled CPUs have interrupts disabled?" However, the current
stall warnings are silent on this point. This commit therefore
uses irq_work to check whether stalled CPUs still respond to IPIs,
and flags this state in the RCU CPU stall warning console messages.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f79c3ad6 30-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

sched,rcu: Make cond_resched() provide RCU quiescent state

There is some confusion as to which of cond_resched() or
cond_resched_rcu_qs() should be added to long in-kernel loops.
This commit therefore eliminates the decision by adding RCU quiescent
states to cond_resched(). This commit also simplifies the code that
used to interact with cond_resched_rcu_qs(), and that now interacts with
cond_resched(), to reduce its overhead. This reduction is necessary to
allow the heavier-weight cond_resched_rcu_qs() mechanism to be invoked
everywhere that cond_resched() is invoked.

Part of that reduction in overhead converts the jiffies_till_sched_qs
kernel parameter to read-only at runtime, thus eliminating the need for
bounds checking.

Reported-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
[ paulmck: Keep PREEMPT=n cond_resched a no-op, per Peter Zijlstra. ]


# f39b536c 25-Sep-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove extraneous READ_ONCE()s from rcu_irq_{enter,exit}()

The read of ->dynticks_nmi_nesting in rcu_irq_enter() and rcu_irq_exit()
is currently protected with READ_ONCE(). However, this protection is
unnecessary because (1) ->dynticks_nmi_nesting is updated only by the
current CPU, (2) Although NMI handlers can update this field, they reset
it back to its old value before return, and (3) Interrupts are disabled,
so nothing else can modify it. The value of ->dynticks_nmi_nesting is
thus effectively constant, and so no protection is required.

This commit therefore removes the READ_ONCE() protection from these
two accesses.

Link: http://lkml.kernel.org/r/20170926031902.GA2074@linux.vnet.ibm.com

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 28585a83 22-Sep-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Allow for page faults in NMI handlers

A number of architecture invoke rcu_irq_enter() on exception entry in
order to allow RCU read-side critical sections in the exception handler
when the exception is from an idle or nohz_full CPU. This works, at
least unless the exception happens in an NMI handler. In that case,
rcu_nmi_enter() would already have exited the extended quiescent state,
which would mean that rcu_irq_enter() would (incorrectly) cause RCU
to think that it is again in an extended quiescent state. This will
in turn result in lockdep splats in response to later RCU read-side
critical sections.

This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to
take no action if there is an rcu_nmi_enter() in effect, thus avoiding
the unscheduled return to RCU quiescent state. This in turn should
make the kernel safe for on-demand RCU voyeurism.

Link: http://lkml.kernel.org/r/20170922211022.GA18084@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: 0be964be0 ("module: Sanitize RCU usage and locking")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 9b130ad5 08-Sep-2017 Alexey Dobriyan <adobriyan@gmail.com>

treewide: make "nr_cpu_ids" unsigned

First, number of CPUs can't be negative number.

Second, different signnnedness leads to suboptimal code in the following
cases:

1)
kmalloc(nr_cpu_ids * sizeof(X));

"int" has to be sign extended to size_t.

2)
while (loff_t *pos < nr_cpu_ids)

MOVSXD is 1 byte longed than the same MOV.

Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".

Code savings on allyesconfig kernel: -3KB

add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
function old new delta
coretemp_cpu_online 450 512 +62
rcu_init_one 1234 1272 +38
pci_device_probe 374 399 +25

...

pgdat_reclaimable_pages 628 556 -72
select_fallback_rq 446 369 -77
task_numa_find_cpu 1923 1807 -116

Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 16c0b106 12-Jul-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()

The rcu_idle_exit() and rcu_idle_enter() functions are exported because
they were originally used by RCU_NONIDLE(), which was intended to
be usable from modules. However, RCU_NONIDLE() now instead uses
rcu_irq_enter_irqson() and rcu_irq_exit_irqson(), which are not
exported, and there have been no complaints.

This commit therefore removes the exports from rcu_idle_exit() and
rcu_idle_enter().

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d4db30af 12-Jul-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add warning to rcu_idle_enter() for irqs enabled

All current callers of rcu_idle_enter() have irqs disabled, and
rcu_idle_enter() relies on this, but doesn't check. This commit
therefore adds a RCU_LOCKDEP_WARN() to add some verification to the trust.
While we are there, pass "true" rather than "1" to rcu_eqs_enter().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3a607992 12-Jul-2017 Peter Zijlstra (Intel) <peterz@infradead.org>

rcu: Make rcu_idle_enter() rely on callers disabling irqs

All callers to rcu_idle_enter() have irqs disabled, so there is no
point in rcu_idle_enter disabling them again. This commit therefore
replaces the irq disabling with a RCU_LOCKDEP_WARN().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2dee9404 11-Jul-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add assertions verifying blocked-tasks list

This commit adds assertions verifying the consistency of the rcu_node
structure's ->blkd_tasks list and its ->gp_tasks, ->exp_tasks, and
->boost_tasks pointers. In particular, the ->blkd_tasks lists must be
empty except for leaf rcu_node structures.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 35fe723b 27-Jun-2017 Masami Hiramatsu <mhiramat@kernel.org>

rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()

Set disable_rcu_irq_enter on not only rcu_eqs_enter_common() but also
rcu_eqs_exit(), since rcu_eqs_exit() suffers from the same issue as was
fixed for rcu_eqs_enter_common() by commit 03ecd3f48e57 ("rcu/tracing:
Add rcu_disabled to denote when rcu_irq_enter() will not work").

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d8db2e86 27-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add TPS() protection for _rcu_barrier_trace strings

The _rcu_barrier_trace() function is a wrapper for trace_rcu_barrier(),
which needs TPS() protection for strings passed through the second
argument. However, it has escaped prior TPS()-ification efforts because
it _rcu_barrier_trace() does not start with "trace_". This commit
therefore adds the needed TPS() protection

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# d5374226 20-Jun-2017 Luis R. Rodriguez <mcgrof@kernel.org>

rcu: Use idle versions of swait to make idle-hack clear

These RCU waits were set to use interruptible waits to avoid the kthreads
contributing to system load average, even though they are not interruptible
as they are spawned from a kthread. Use the new TASK_IDLE swaits which makes
our goal clear, and removes confusion about these paths possibly being
interruptible -- they are not.

When the system is idle the RCU grace-period kthread will spend all its time
blocked inside the swait_event_interruptible(). If the interruptible() was
not used, then this kthread would contribute to the load average. This means
that an idle system would have a load average of 2 (or 3 if PREEMPT=y),
rather than the load average of 0 that almost fifty years of UNIX has
conditioned sysadmins to expect.

The same argument applies to swait_event_interruptible_timeout() use. The
RCU grace-period kthread spends its time blocked inside this call while
waiting for grace periods to complete. In particular, if there was only one
busy CPU, but that CPU was frequently invoking call_rcu(), then the RCU
grace-period kthread would spend almost all its time blocked inside the
swait_event_interruptible_timeout(). This would mean that the load average
would be 2 rather than the expected 1 for the single busy CPU.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 09efeeee 19-Jul-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Move callback-list warning to irq-disable region

After adopting callbacks from a newly offlined CPU, the adopting CPU
checks to make sure that its callback list's count is zero only if the
list has no callbacks and vice versa. Unfortunately, it does so after
enabling interrupts, which means that false positives are possible due to
interrupt handlers invoking call_rcu(). Although these false positives
are improbable, rcutorture did make it happen once.

This commit therefore moves this check to an irq-disabled region of code,
thus suppressing the false positive.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f2dbe4a5 27-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Localize rcu_state ->orphan_pend and ->orphan_done

Given that the rcu_state structure's >orphan_pend and ->orphan_done
fields are used only during migration of callbacks from the recently
offlined CPU to a surviving CPU, if rcu_send_cbs_to_orphanage() and
rcu_adopt_orphan_cbs() are combined, these fields can become local
variables in the combined function. This commit therefore combines
rcu_send_cbs_to_orphanage() and rcu_adopt_orphan_cbs() into a new
rcu_segcblist_merge() function and removes the ->orphan_pend and
->orphan_done fields.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 21cc2483 26-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Advance callbacks after migration

When migrating callbacks from a newly offlined CPU, we are already
holding the root rcu_node structure's lock, so it costs almost nothing
to advance and accelerate the newly migrated callbacks. This patch
therefore makes this advancing and acceleration happen.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 537b85c8 26-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate rcu_state ->orphan_lock

The ->orphan_lock is acquired and released only within the
rcu_migrate_callbacks() function, which now acquires the root rcu_node
structure's ->lock. This commit therefore eliminates the ->orphan_lock
in favor of the root rcu_node structure's ->lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9fa46fb8 26-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Advance outgoing CPU's callbacks before migrating them

It is possible that the outgoing CPU is unaware of recent grace periods,
and so it is also possible that some of its pending callbacks are actually
ready to be invoked. The current callback-migration code would needlessly
force these callbacks to pass through another grace period. This commit
therefore invokes rcu_advance_cbs() on the outgoing CPU's callbacks in
order to give them full credit for having passed through any recent
grace periods.

This also fixes an odd theoretical bug where there are no callbacks in
the system except for those on the outgoing CPU, none of those callbacks
have yet been associated with a grace-period number, there is never again
another callback registered, and the surviving CPU never again takes a
scheduling-clock interrupt, never goes idle, and never enters nohz_full
userspace execution. Yes, this is (just barely) possible. It requires
that the surviving CPU be a nohz_full CPU, that its scheduler-clock
interrupt be shut off, and that it loop forever in the kernel. You get
bonus points if you can make this one happen! ;-)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b1a2d79f 26-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Make NOCB CPUs migrate CBs directly from outgoing CPU

RCU's CPU-hotplug callback-migration code first moves the outgoing
CPU's callbacks to ->orphan_done and ->orphan_pend, and only then
moves them to the NOCB callback list. This commit avoids the
extra step (and simplifies the code) by moving the callbacks directly
from the outgoing CPU's callback list to the NOCB callback list.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 95335c03 26-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Check for NOCB CPUs and empty lists earlier in CB migration

The current CPU-hotplug RCU-callback-migration code checks
for the source (newly offlined) CPU being a NOCBs CPU down in
rcu_send_cbs_to_orphanage(). This commit simplifies callback migration a
bit by moving this check up to rcu_migrate_callbacks(). This commit also
adds a check for the source CPU having no callbacks, which eases analysis
of the rcu_send_cbs_to_orphanage() and rcu_adopt_orphan_cbs() functions.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c47e067a 23-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove orphan/adopt event-tracing fields

The rcu_node structure's ->n_cbs_orphaned and ->n_cbs_adopted fields
are updated, but never read. This commit therefore removes them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 313517fc 08-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Make expedited GPs correctly handle hardware CPU insertion

The update of the ->expmaskinitnext and of ->ncpus are unsynchronized,
with the value of ->ncpus being incremented long before the corresponding
->expmaskinitnext mask is updated. If an RCU expedited grace period
sees ->ncpus change, it will update the ->expmaskinit masks from the new
->expmaskinitnext masks. But it is possible that ->ncpus has already
been updated, but the ->expmaskinitnext masks still have their old values.
For the current expedited grace period, no harm done. The CPU could not
have been online before the grace period started, so there is no need to
wait for its non-existent pre-existing readers.

But the next RCU expedited grace period is in a world of hurt. The value
of ->ncpus has already been updated, so this grace period will assume
that the ->expmaskinitnext masks have not changed. But they have, and
they won't be taken into account until the next never-been-online CPU
comes online. This means that RCU will be ignoring some CPUs that it
should be paying attention to.

The solution is to update ->ncpus and ->expmaskinitnext while holding
the ->lock for the rcu_node structure containing the ->expmaskinitnext
mask. Because smp_store_release() is now used to update ->ncpus and
smp_load_acquire() is now used to locklessly read it, if the expedited
grace period sees ->ncpus change, then the updating CPU has to
already be holding the corresponding ->lock. Therefore, when the
expedited grace period later acquires that ->lock, it is guaranteed
to see the new value of ->expmaskinitnext.

On the other hand, if the expedited grace period loads ->ncpus just
before an update, earlier full memory barriers guarantee that
the incoming CPU isn't far enough along to be running any RCU readers.

This commit therefore makes the required change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a58163d8 20-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Migrate callbacks earlier in the CPU-offline timeline

RCU callbacks must be migrated away from an outgoing CPU, and this is
done near the end of the CPU-hotplug operation, after the outgoing CPU is
long gone. Unfortunately, this means that other CPU-hotplug callbacks
can execute while the outgoing CPU's callbacks are still immobilized
on the long-gone CPU's callback lists. If any of these CPU-hotplug
callbacks must wait, either directly or indirectly, for the invocation
of any of the immobilized RCU callbacks, the system will hang.

This commit avoids such hangs by migrating the callbacks away from the
outgoing CPU immediately upon its departure, shortly after the return
from __cpu_die() in takedown_cpu(). Thus, RCU is able to advance these
callbacks and invoke them, which allows all the after-the-fact CPU-hotplug
callbacks to wait on these RCU callbacks without risk of a hang.

While in the neighborhood, this commit also moves rcu_send_cbs_to_orphanage()
and rcu_adopt_orphan_cbs() under a pre-existing #ifdef to avoid including
dead code on the one hand and to avoid define-without-use warnings on the
other hand.

Reported-by: Jeffrey Hugo <jhugo@codeaurora.org>
Link: http://lkml.kernel.org/r/db9c91f6-1b17-6136-84f0-03c3c2581ab4@codeaurora.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Richard Weinberger <richard@nod.at>


# 96036c43 18-Jul-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add last-CPU to GP-kthread starvation messages

This commit augments the grace-period-kthread starvation debugging
messages by adding the last CPU that ran the kthread.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fe5ac724 11-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove nohz_full full-system-idle state machine

The NO_HZ_FULL_SYSIDLE full-system-idle capability was added in 2013
by commit 0edd1b1784cb ("nohz_full: Add full-system-idle state machine"),
but has not been used. This commit therefore removes it.

If it turns out to be needed later, this commit can always be reverted.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>


# f7a10a97 10-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove the RCU_KTHREAD_PRIO Kconfig option

Anything that can be done with the RCU_KTHREAD_PRIO Kconfig option can
also be done with the rcutree.kthread_prio kernel boot parameter.
This commit therefore removes this Kconfig option.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>


# 90040c9e 10-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove *_SLOW_* Kconfig options

The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.

This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fa3c6647 03-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Improve __call_rcu() debug-objects error message

The "__call_rcu(): Leaked duplicate callback" error message from
__call_rcu() has proven to be unhelpful. This commit therefore changes
it to "__call_rcu(): Double-freed CB" and adds the value of the pointer
passed in. The value of the pointer improves debuggability by allowing
correlation with tracing output, for example, the rcu:rcu_callback trace
event.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 791875d1 03-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate the unused __rcu_is_watching() function

The __rcu_is_watching() function is currently not used, aside from
to implement the rcu_is_watching() function. This commit therefore
eliminates __rcu_is_watching(), which has the beneficial side-effect
of shrinking include/linux/rcupdate.h a bit.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a68a2bb2 03-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Move docbook comments out of rcupdate.h

The include/linux/rcupdate.h file is included by more than 200
files, so shrinking it should provide some build-time benefits.
This commit therefore moves several docbook comments from rcupdate.h to
kernel/rcu/update.c, kernel/rcu/tree.c, and kernel/rcu/tree_plugin.h, thus
reducing the number of times that the compiler has to scan these comments.
This likely provides only a small benefit, but every little bit helps.

This commit also fixes a malformed bulleted list noted by the 0day
Test Robot.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c0b334c5 28-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add lockdep_assert_held() teeth to tree.c

Comments can be helpful, but assertions carry more force. This
commit therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN()
calls to enforce lock-held and interrupt-disabled preconditions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 17c7798b 28-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Update rcu_bootup_announce_oddness()

This commit updates rcu_bootup_announce_oddness() to check additional
Kconfig options and module/boot parameters.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f4687d26 27-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Add preemptibility checks in rcu_sched_qs() and rcu_bh_qs()

This commit adds WARN_ON_ONCE() calls that trigger if either
rcu_sched_qs() or rcu_bh_qs() are invoked with preemption enabled.
In the immortal words of Peter Zijlstra: "these are much harder to ignore
than comments".

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e28371c8 17-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove obsolete reference to synchronize_kernel()

The synchronize_kernel() primitive was removed in favor of
synchronize_sched() more than a decade ago, and it seems likely that
rather few kernel hackers are familiar with it. Its continued presence
is therefore providing more confusion than enlightenment. This commit
therefore removes the reference from the synchronize_sched() header
comment, and adds the corresponding information to the synchronize_rcu(0
header comment.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5b72f964 12-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Complain if blocking in preemptible RCU read-side critical section

Although preemptible RCU allows its read-side critical sections to be
preempted, general blocking is forbidden. The reason for this is that
excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a
voluntarily blocked task doesn't care how high you boost its priority.
Because preemptible RCU is a global mechanism, one ill-behaved reader
hurts everyone. Hence the prohibition against general blocking in
RCU-preempt read-side critical sections. Preemption yes, blocking no.

This commit enforces this prohibition.

There is a special exception for the -rt patchset (which they kindly
volunteered to implement): It is OK to block (as opposed to merely being
preempted) within an RCU-preempt read-side critical section, but only if
the blocking is subject to priority inheritance. This exception permits
CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble.

Why doesn't this exception also apply to mainline's rt_mutex? Because
of the possibility that someone does general blocking while holding
an rt_mutex. Yes, the priority boosting will affect the rt_mutex,
but it won't help with the task doing general blocking while holding
that rt_mutex.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f92c734f 10-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Prevent rcu_barrier() from starting needless grace periods

Currently rcu_barrier() uses call_rcu() to enqueue new callbacks
on each CPU with a non-empty callback list. This works, but means
that rcu_barrier() forces grace periods that are not otherwise needed.
The key point is that rcu_barrier() never needs to wait for a grace
period, but instead only for all pre-existing callbacks to be invoked.
This means that rcu_barrier()'s new callbacks should be placed in
the callback-list segment containing the last pre-existing callback.

This commit makes this change using the new rcu_segcblist_entrain()
function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 933dfbd7 02-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Open-code the rcu_cblist_n_lazy_cbs() function

Because the rcu_cblist_n_lazy_cbs() just samples the ->len_lazy counter,
and because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_n_lazy_cbs(p) as p->len_lazy, cutting out
a level of indirection. This commit makes this change.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>


# 4b27f20b 02-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Open-code the rcu_cblist_n_cbs() function

Because the rcu_cblist_n_cbs() just samples the ->len counter, and
because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_n_cbs(p) as p->len, cutting out a level
of indirection. This commit makes this change.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>


# 8ef0f37e 02-May-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Open-code the rcu_cblist_empty() function

Because the rcu_cblist_empty() just samples the ->head pointer, and
because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_empty(p) as !p->head, cutting out a
level of indirection. This commit makes this change.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>


# 7f6733c3 18-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Make rcutorture writer stalls print SRCU GP state

In the past, SRCU was simple enough that there was little point in
making the rcutorture writer stall messages print the SRCU grace-period
number state. With the advent of Tree SRCU, this has changed. This
commit therefore makes Classic, Tiny, and Tree SRCU report this state
to rcutorture as needed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>


# bcbfdd01 11-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Make non-preemptive schedule be Tasks RCU quiescent state

Currently, a call to schedule() acts as a Tasks RCU quiescent state
only if a context switch actually takes place. However, just the
call to schedule() guarantees that the calling task has moved off of
whatever tracing trampoline that it might have been one previously.
This commit therefore plumbs schedule()'s "preempt" parameter into
rcu_note_context_switch(), which then records the Tasks RCU quiescent
state, but only if this call to schedule() was -not- due to a preemption.

To avoid adding overhead to the common-case context-switch path,
this commit hides the rcu_note_context_switch() check under an existing
non-common-case check.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# da915ad5 05-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Parallelize callback handling

Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
however, there are workloads that could result in a high volume of
concurrent invocations of call_srcu(), which with current SRCU would
result in excessive lock contention on the srcu_struct structure's
->queue_lock, which protects SRCU's callback lists. This commit therefore
moves SRCU to per-CPU callback lists, thus greatly reducing contention.

Because a given SRCU instance no longer has a single centralized callback
list, starting grace periods and invoking callbacks are both more complex
than in the single-list Classic SRCU implementation. Starting grace
periods and handling callbacks are now handled using an srcu_node tree
that is in some ways similar to the rcu_node trees used by RCU-bh,
RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
controlled by exactly the same Kconfig options and boot parameters that
control the shape of the rcu_node tree).

In addition, the old per-CPU srcu_array structure is now named srcu_data
and contains an rcu_segcblist structure named ->srcu_cblist for its
callbacks (and a spinlock to protect this). The srcu_struct gets
an srcu_gp_seq that is used to associate callback segments with the
corresponding completion-time grace-period number. These completion-time
grace-period numbers are propagated up the srcu_node tree so that the
grace-period workqueue handler can determine whether additional grace
periods are needed on the one hand and where to look for callbacks that
are ready to be invoked.

The srcu_barrier() function must now wait on all instances of the per-CPU
->srcu_cblist. Because each ->srcu_cblist is protected by ->lock,
srcu_barrier() can remotely add the needed callbacks. In theory,
it could also remotely start grace periods, but in practice doing so
is complex and racy. And interestingly enough, it is never necessary
for srcu_barrier() to start a grace period because srcu_barrier() only
enqueues a callback when a callback is already present--and it turns out
that a grace period has to have already been started for this pre-existing
callback. Furthermore, it is only the callback that srcu_barrier()
needs to wait on, not any particular grace period. Therefore, a new
rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
callback into the same segment occupied by the last pre-existing callback
in the list. The special case where all the pre-existing callbacks are
on a different list (because they are in the process of being invoked)
is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
segment, relying on the done-callbacks check that takes place after all
callbacks are inovked.

Note that the readers use the same algorithm as before. Note that there
is a separate srcu_idx that tells the readers what counter to increment.
This unfortunately cannot be combined with srcu_gp_seq because they
need to be incremented at different times.

This commit introduces some ugly #ifdefs in rcutorture. These will go
away when I feel good enough about Tree SRCU to ditch Classic SRCU.

Some crude performance comparisons, courtesy of a quickly hacked rcuperf
asynchronous-grace-period capability:

Callback Queuing Overhead
-------------------------
# CPUS Classic SRCU Tree SRCU
------ ------------ ---------
2 0.349 us 0.342 us
16 31.66 us 0.4 us
41 --------- 0.417 us

The times are the 90th percentiles, a statistic that was chosen to reject
the overheads of the occasional srcu_barrier() call needed to avoid OOMing
the test machine. The rcuperf test hangs when running Classic SRCU at 41
CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code
and that statistics, this is a convincing demonstration of Tree SRCU's
performance and scalability advantages.

[1] https://lwn.net/Articles/309030/
[2] https://patchwork.kernel.org/patch/5108281/

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]


# bfd090be 08-Feb-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix typo in PER_RCU_NODE_PERIOD header comment

This commit just changes a "the the" to "the" to reduce repetition.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 50dc7def 25-Mar-2017 Nicholas Mc Guire <der.herr@hofr.at>

rcu: Use bool value directly

The beenonline variable is declared bool so there is no need for an
explicit comparison, especially not against the constant zero.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# deb34f36 23-Mar-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Improve comments for hotplug/suspend/hibernate functions

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d1e4f01d 08-Feb-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove obsolete comment from rcu_future_gp_cleanup() header

The rcu_nocb_gp_cleanup() function is now invoked elsewhere, so this
commit drags this comment into the year 2017.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e95d68d2 15-Mar-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Make num_rcu_lvl[] array be external

This commit makes the num_rcu_lvl[] array external so that SRCU can
make use of it for initializing its upcoming srcu_node tree.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 41f5c631 15-Mar-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove redundant levelcnt[] array from rcu_init_one()

The levelcnt[] array is identical to num_rcu_lvl[], so this commit
removes levelcnt[].

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2b34c43c 14-Mar-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Move rcu_init_levelspread() to rcu_tree_node.h

This commit moves the rcu_init_levelspread() function from
kernel/rcu/tree.c to kernel/rcu/rcu.h so that SRCU can access it. This is
another step towards enabling SRCU to create its own combining tree.
This commit is code-movement only, give or take knock-on adjustments.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2e8c28c2 20-Feb-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Move rcu_seq_start() and friends to rcu.h

This commit moves rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(),
and rcu_seq_done() from kernel/rcu/tree.c to kernel/rcu/rcu.h.
This will allow SRCU to use these functions, which in turn will
allow SRCU to move from a single global callback queue to a
per-CPU callback queue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 900b1028 10-Feb-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Allow SRCU to access rcu_scheduler_active

This is primarily a code-movement commit in preparation for allowing
SRCU to handle early-boot SRCU grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 15fecf89 08-Feb-2017 Paul E. McKenney <paulmck@kernel.org>

srcu: Abstract multi-tail callback list handling

RCU has only one multi-tail callback list, which is implemented via
the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the
rcu_data structure, and whose operations are open-code throughout the
Tree RCU implementation. This has been more or less OK in the past,
but upcoming callback-list optimizations in SRCU could really use
a multi-tail callback list there as well.

This commit therefore abstracts the multi-tail callback list handling
into a new kernel/rcu/rcu_segcblist.h file, and uses this new API.
The simple head-and-tail pointer callback list is also abstracted and
applied everywhere except for the NOCB callback-offload lists. (Yes,
the plan is to apply them there as well, but this commit is already
bigger than would be good.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9226b10d 27-Jan-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() actions

The rcu_all_qs() and rcu_note_context_switch() do a series of checks,
taking various actions to supply RCU with quiescent states, depending
on the outcomes of the various checks. This is a bit much for scheduling
fastpaths, so this commit creates a separate ->rcu_urgent_qs field in
the rcu_dynticks structure that acts as a global guard for these checks.
Thus, in the common case, rcu_all_qs() and rcu_note_context_switch()
check the ->rcu_urgent_qs field, find it false, and simply return.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>


# 0f9be8ca 27-Jan-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate flavor scan in rcu_momentary_dyntick_idle()

The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking
that one of them still needs a quiescent state before doing an expensive
atomic operation on the ->dynticks counter. However, this check reduces
overhead only after a rare race condition, and increases complexity. This
commit therefore removes the scan and the mechanism enabling the scan.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9577df9a 26-Jan-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Pull rcu_qs_ctr into rcu_dynticks structure

The rcu_qs_ctr variable is yet another isolated per-CPU variable,
so this commit pulls it into the pre-existing rcu_dynticks per-CPU
structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# abb06b99 26-Jan-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Pull rcu_sched_qs_mask into rcu_dynticks structure

The rcu_sched_qs_mask variable is yet another isolated per-CPU variable,
so this commit pulls it into the pre-existing rcu_dynticks per-CPU
structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 88a4976d 23-Jan-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Semicolon inside RCU_TRACE() for tree.c

The current use of "RCU_TRACE(statement);" can cause odd bugs, especially
where "statement" is a local-variable declaration, as it can leave a
misplaced ";" in the source code. This commit therefore converts these
to "RCU_TRACE(statement;)", which avoids the misplaced ";".

Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b8c17e66 08-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Maintain special bits at bottom of ->dynticks counter

Currently, IPIs are used to force other CPUs to invalidate their TLBs
in response to a kernel virtual-memory mapping change. This works, but
degrades both battery lifetime (for idle CPUs) and real-time response
(for nohz_full CPUs), and in addition results in unnecessary IPIs due to
the fact that CPUs executing in usermode are unaffected by stale kernel
mappings. It would be better to cause a CPU executing in usermode to
wait until it is entering kernel mode to do the flush, first to avoid
interrupting usemode tasks and second to handle multiple flush requests
with a single flush in the case of a long-running user task.

This commit therefore reserves a bit at the bottom of the ->dynticks
counter, which is checked upon exit from extended quiescent states.
If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
invoked, which, if not supplied, is an empty single-pass do-while loop.
If this bottom bit is set on -entry- to an extended quiescent state,
then a WARN_ON_ONCE() triggers.

This bottom bit may be set using a new rcu_eqs_special_set() function,
which returns true if the bit was set, or false if the CPU turned
out to not be in an extended quiescent state. Please note that this
function refuses to set the bit for a non-nohz_full CPU when that CPU
is executing in usermode because usermode execution is tracked by RCU
as a dyntick-idle extended quiescent state only for nohz_full CPUs.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 03ecd3f4 06-Apr-2017 Steven Rostedt (VMware) <rostedt@goodmis.org>

rcu/tracing: Add rcu_disabled to denote when rcu_irq_enter() will not work

Tracing uses rcu_irq_enter() as a way to make sure that RCU is watching when
it needs to use rcu_read_lock() and friends. This is because tracing can
happen as RCU is about to enter user space, or about to go idle, and RCU
does not watch for RCU read side critical sections as it makes the
transition.

There is a small location within the RCU infrastructure that rcu_irq_enter()
itself will not work. If tracing were to occur in that section it will break
if it tries to use rcu_irq_enter().

Originally, this happens with the stack_tracer, because it will call
save_stack_trace when it encounters stack usage that is greater than any
stack usage it had encountered previously. There was a case where that
happened in the RCU section where rcu_irq_enter() did not work, and lockdep
complained loudly about it. To fix it, stack tracing added a call to be
disabled and RCU would disable stack tracing during the critical section
that rcu_irq_enter() was inoperable. This solution worked, but there are
other cases that use rcu_irq_enter() and it would be a good idea to let RCU
give a way to let others know that rcu_irq_enter() will not work. For
example, in trace events.

Another helpful aspect of this change is that it also moves the per cpu
variable called in the RCU critical section into a cache locale along with
other RCU per cpu variables used in that same location.

I'm keeping the stack_trace_disable() code, as that still could be used in
the future by places that really need to disable it. And since it's only a
static inline, it wont take up any kernel text if it is not used.

Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.home

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# a278d471 05-Apr-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix dyntick-idle tracing

The tracing subsystem started using rcu_irq_entry() and rcu_irq_exit()
(with my blessing) to allow the current _rcuidle alternative tracepoint
name to be dispensed with while still maintaining good performance.
Unfortunately, this causes RCU's dyntick-idle entry code's tracing to
appear to RCU like an interrupt that occurs where RCU is not designed
to handle interrupts.

This commit fixes this problem by moving the zeroing of ->dynticks_nesting
after the offending trace_rcu_dyntick() statement, which narrows the
window of vulnerability to a pair of adjacent statements that are now
marked with comments to that effect.

Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.home
Link: http://lkml.kernel.org/r/20170405193928.GM1600@linux.vnet.ibm.com

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# b17b0153 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/debug.h>

We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ae7e81c0 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h>

We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.

Create empty placeholder header that maps to <linux/types.h>.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f9411ebe 06-Feb-2017 Ingo Molnar <mingo@kernel.org>

rcu: Separate the RCU synchronization types and APIs into <linux/rcupdate_wait.h>

So rcupdate.h is a pretty complex header, in particular it includes
<linux/completion.h> which includes <linux/wait.h> - creating a
dependency that includes <linux/wait.h> in <linux/sched.h>,
which prevents the isolation of <linux/sched.h> from the derived
<linux/wait.h> header.

Solve part of the problem by decoupling rcupdate.h from completions:
this can be done by separating out the rcu_synchronize types and APIs,
and updating their usage sites.

Since this is a mostly RCU-internal types this will not just simplify
<linux/sched.h>'s dependencies, but will make all the hundreds of
.c files that include rcupdate.h but not completions or wait.h build
faster.

( For rcutiny this means that two dependent APIs have to be uninlined,
but that shouldn't be much of a problem as they are rare variants. )

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 38d30b33 01-Dec-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Adjust FQS offline checks for exact online-CPU detection

Commit 7ec99de36f40 ("rcu: Provide exact CPU-online tracking for RCU"),
as its title suggests, got rid of RCU's remaining CPU-hotplug timing
guesswork. This commit therefore removes the one-jiffy kludge that was
used to paper over this guesswork.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 3a19b46a 30-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Check cond_resched_rcu_qs() state less often to reduce GP overhead

Commit 4a81e8328d37 ("rcu: Reduce overhead of cond_resched() checks
for RCU") moved quiescent-state generation out of cond_resched()
and commit bde6c3aa9930 ("rcu: Provide cond_resched_rcu_qs() to force
quiescent states in long loops") introduced cond_resched_rcu_qs(), and
commit 5cd37193ce85 ("rcu: Make cond_resched_rcu_qs() apply to normal RCU
flavors") introduced the per-CPU rcu_qs_ctr variable, which is frequently
polled by the RCU core state machine.

This frequent polling can increase grace-period rate, which in turn
increases grace-period overhead, which is visible in some benchmarks
(for example, the "open1" benchmark in Anton Blanchard's "will it scale"
suite). This commit therefore reduces the rate at which rcu_qs_ctr
is polled by moving that polling into the force-quiescent-state (FQS)
machinery, and by further polling it only after the grace period has
been in effect for at least jiffies_till_sched_qs jiffies.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 02a5c550 02-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract extended quiescent state determination

This commit is the fourth step towards full abstraction of all accesses
to the ->dynticks counter, implementing previously open-coded checks and
comparisons in new rcu_dynticks_in_eqs() and rcu_dynticks_in_eqs_since()
functions. This abstraction will ease changes to the ->dynticks counter
operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 2625d469 02-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract dynticks extended quiescent state enter/exit operations

This commit is the third step towards full abstraction of all accesses
to the ->dynticks counter, implementing the previously open-coded atomic
add of 1 and entry checks in a new rcu_dynticks_eqs_enter() function, and
the same but with exit checks in a new rcu_dynticks_eqs_exit() function.
This abstraction will ease changes to the ->dynticks counter operation.

Note that this commit gets rid of the smp_mb__before_atomic() and the
smp_mb__after_atomic() calls that were previously present. The reason
that this is OK from a memory-ordering perspective is that the atomic
operation is now atomic_add_return(), which, as a value-returning atomic,
guarantees full ordering.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fixed RCU_TRACE() statements added by this commit. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# fdbb9b31 20-Dec-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_cpu_starting() use its "cpu" argument

The rcu_cpu_starting() function uses this_cpu_ptr() to locate the
incoming CPU's rcu_data structure. This works for the boot CPU and for
all CPUs onlined after rcu_init() executes (during very early boot).
Currently, this is the full set of CPUs, so all is well. But if
anyone ever parallelizes boot before rcu_init() time, it will fail.
This commit therefore substitutes the rcu_cpu_starting() function's
this_cpu_pointer() for per_cpu_ptr(), future-proofing the code and
(arguably) improving readability.

This commit inadvertently fixes a latent bug: If there ever had been
more than just the boot CPU online at rcu_init() time, the old code
would not initialize the non-boot CPUs, but rather would repeatedly
initialize the boot CPU.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 630c7ed9 15-Dec-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't wake rcuc/X kthreads on NOCB CPUs

Chris Friesen notice that rcuc/X kthreads were consuming CPU even on
NOCB CPUs. This makes no sense because the only purpose or these
kthreads is to invoke normal (non-offloaded) callbacks, of which there
will never be any on NOCB CPUs. This problem was due to a bug in
cpu_has_callbacks_ready_to_invoke(), which should have been checking
->nxttail[RCU_NEXT_TAIL] for NULL, but which was instead (incorrectly)
checking ->nxttail[RCU_DONE_TAIL]. Because ->nxttail[RCU_DONE_TAIL] is
never NULL, the only effect is to cause the rcuc/X kthread to execute
when it should not do so.

This commit therefore checks ->nxttail[RCU_NEXT_TAIL], which is NULL
for NOCB CPUs.

Reported-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 7aa92230 29-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Once again use NMI-based stack traces in stall warnings

This commit is for all intents and purposes a revert of bc1dce514e9b
("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to suppose
that this can now safely be reverted is the presence of 42a0bb3f7138
("printk/nmi: generic solution for safe printk in NMI"), which is said
to have made NMI-based stack dumps safe.

However, this reversion keeps one nice property of bc1dce514e9b
("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that
only those CPUs blocking the grace period are dumped. The new
trigger_single_cpu_backtrace() is used to make this happen, as
suggested by Josh Poimboeuf.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# b201fa67 25-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove short-term CPU kicking

Commit 4914950aaa12d ("rcu: Stop treating in-kernel CPU-bound workloads
as errors") added a (relatively) short-timeout call to resched_cpu().
This was inspired by as issue that was fixed by b7e7ade34e61 ("sched/core:
Fix remote wakeups"). But given that this issue was fixed, it is time
for the current commit to remove this call to resched_cpu().

Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 28053bc7 01-Dec-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Add long-term CPU kicking

This commit prepares for the removal of short-term CPU kicking (in a
subsequent commit). It does so by starting to invoke resched_cpu()
for each holdout at each force-quiescent-state interval that is more
than halfway through the stall-warning interval.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 94060d22 17-Oct-2016 Tobias Klauser <tklauser@distanz.ch>

rcu: Remove unused but set variable

Since commit 7ec99de36f40 ("rcu: Provide exact CPU-online tracking for
RCU"), the variable mask in rcu_init_percpu_data is set but no longer
used. Remove it to fix the following warning when building with 'W=1':

kernel/rcu/tree.c: In function ‘rcu_init_percpu_data’:
kernel/rcu/tree.c:3765:16: warning: variable ‘mask’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# c4402b27 09-Nov-2016 Byungchul Park <byungchul.park@lge.com>

rcu: Only dump stalled-tasks stacks if there was a real stall

The print_other_cpu_stall() function currently unconditionally invokes
rcu_print_detail_task_stall(). This is OK because if there was a stall
sufficient to cause print_other_cpu_stall() to be invoked, that stall
is very likely to persist through the entire print_other_cpu_stall()
execution. However, if the stall did not persist, the variable ndetected
will be zero, and that variable is already tested in an "if" statement.
Therefore, this commit moves the call to rcu_print_detail_task_stall()
under that pre-existing "if" to improve readability, with a very rare
reduction in overhead.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ paulmck: Reworked commit log. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 8b2f63ab 02-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract the dynticks snapshot operation

This commit is the second step towards full abstraction of all accesses to
the ->dynticks counter, implementing the previously open-coded atomic
add of zero in a new rcu_dynticks_snap() function. This abstraction will
ease changes o the ->dynticks counter operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 6563de9d 02-Nov-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract the dynticks momentary-idle operation

This commit is the first step towards full abstraction of all accesses to
the ->dynticks counter, implementing the previously open-coded atomic add
of two in a new rcu_dynticks_momentary_idle() function. This abstraction
will ease changes to the ->dynticks counter operation.

Note that this commit gets rid of the smp_mb__before_atomic() and the
smp_mb__after_atomic() calls that were previously present. The reason
that this is OK from a memory-ordering perspective is that the atomic
operation is now atomic_add_return(), which, as a value-returning atomic,
guarantees full ordering.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 52d7e48b 10-Jan-2017 Paul E. McKenney <paulmck@kernel.org>

rcu: Narrow early boot window of illegal synchronous grace periods

The current preemptible RCU implementation goes through three phases
during bootup. In the first phase, there is only one CPU that is running
with preemption disabled, so that a no-op is a synchronous grace period.
In the second mid-boot phase, the scheduler is running, but RCU has
not yet gotten its kthreads spawned (and, for expedited grace periods,
workqueues are not yet running. During this time, any attempt to do
a synchronous grace period will hang the system (or complain bitterly,
depending). In the third and final phase, RCU is fully operational and
everything works normally.

This has been OK for some time, but there has recently been some
synchronous grace periods showing up during the second mid-boot phase.
This code worked "by accident" for awhile, but started failing as soon
as expedited RCU grace periods switched over to workqueues in commit
8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue").
Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter). The callchain from the failure case is as follows:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited

The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.

This commit therefore reworks RCU to permit synchronous grace periods
to proceed during this mid-boot phase. This commit is therefore a
fix to a regression introduced in v4.9, and is therefore being put
forward post-merge-window in v4.10.

This commit sets a flag from the existing rcu_scheduler_starting()
function which causes all synchronous grace periods to take the expedited
path. The expedited path now checks this flag, using the requesting task
to drive the expedited grace period forward during the mid-boot phase.
Finally, this flag is updated by a core_initcall() function named
rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

Note that this arrangement assumes that tasks are not sent POSIX signals
(or anything similar) from the time that the first task is spawned
through core_initcall() time.

Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
Reported-by: "Zheng, Lv" <lv.zheng@intel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Stan Kain <stan.kain@gmail.com>
Tested-by: Ivan <waffolz@hotmail.com>
Tested-by: Emanuel Castelo <emanuel.castelo@gmail.com>
Tested-by: Bruno Pesavento <bpesavento@infinito.it>
Tested-by: Borislav Petkov <bp@suse.de>
Tested-by: Frederic Bezies <fredbezies@gmail.com>
Cc: <stable@vger.kernel.org> # 4.9.0-


# aa3e0bf1 23-Mar-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't kick unless grace period or request

The current code can result in spurious kicks when there are no grace
periods in progress and no grace-period-related requests. This is
sort of OK for a diagnostic aid, but the resulting ftrace-dump messages
in dmesg are annoying. This commit therefore avoids spurious kicks
in the common case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 1f21b50b 01-Sep-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove obsolete comment from __call_rcu()

The __call_rcu() comment about opportunistically noting grace period
beginnings and endings is obsolete. RCU still does such opportunistic
noting, but in __call_rcu_core() rather than __call_rcu(), and there
already is an appropriate comment in __call_rcu_core(). This commit
therefore removes the obsolete comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 5403d367 01-Sep-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove obsolete rcu_check_callbacks() header comment

In the deep past, rcu_check_callbacks() was only invoked if rcu_pending()
returned true. Which was fine, but these days rcu_check_callbacks()
is invoked unconditionally. This commit therefore removes the obsolete
sentence from the header comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# b8f2ed53 23-Aug-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Tighten up __call_rcu() rcu_head alignment check

Commit 720abae3d68ae ("rcu: force alignment on struct
callback_head/rcu_head") forced the rcu_head (AKA callback_head)
structure's alignment to pointer size, that is, to 4-byte boundaries on
32-bit systems and to 8-byte boundaries on 64-bit systems. This
commit therefore checks for this same alignment in __call_rcu(),
which used to contain a looser check for two-byte alignment.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 0766f788 20-Jun-2016 Emese Revfy <re.emese@gmail.com>

latent_entropy: Mark functions with __latent_entropy

The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>


# 7ec99de3 30-Jun-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide exact CPU-online tracking for RCU

Up to now, RCU has assumed that the CPU-online process makes it from
CPU_UP_PREPARE to set_cpu_online() within one jiffy. Given the recent
rise of virtualized environments, this assumption is very clearly
obsolete. Failing to meet this deadline can result in RCU paying
attention to an incoming CPU for one jiffy, then ignoring it until the
grace period following the one in which that CPU sets itself online.
This situation might prove to be fatally disappointing to any RCU
read-side critical sections that had the misfortune to execute during
the time in which RCU was ignoring the slow-to-come-online CPU.

This commit therefore updates RCU's internal CPU state-tracking
information at notify_cpu_starting() time, thus providing RCU with
an exact transition of the CPU's state from offline to online.

Note that this means that incoming CPUs must not use RCU read-side
critical section (other than those of SRCU) until notify_cpu_starting()
time. Note also that the CPU_STARTING notifiers -are- allowed to use
RCU read-side critical sections. (Of course, CPU-hotplug notifiers are
rapidly becoming obsolete, so you need to act fast!)

If a given architecture or CPU family needs to use RCU read-side
critical sections earlier, the call to rcu_cpu_starting() from
notify_cpu_starting() will need to be architecture-specific, with
architectures that need early use being required to hand-place
the call to rcu_cpu_starting() at some point preceding the call to
notify_cpu_starting().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3563a438 28-Jul-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid redundant quiescent-state chasing

Currently, __note_gp_changes() checks to see if the CPU has slept through
multiple grace periods. If it has, it resynchronizes that CPU's view
of the grace-period state, which includes whether or not the current
grace period needs a quiescent state from this CPU. The fact of this
need (or lack thereof) needs to be in two places, rdp->cpu_no_qs.b.norm
and rdp->core_needs_qs. The former tells RCU's context-switch code to
go get a quiescent state and the latter says that it needs to be reported.
The current code unconditionally sets the former to true, but correctly
sets the latter.

This does not result in failures, but it does unnecessarily increase
the amount of work done on average at context-switch time. This commit
therefore correctly sets both fields.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e77b7041 14-Jul-2016 Paul Gortmaker <paul.gortmaker@windriver.com>

rcu: Don't use modular infrastructure in non-modular code

The Kconfig currently controlling compilation of tree.c is:

init/Kconfig:config TREE_RCU
init/Kconfig: bool

...and update.c and sync.c are "obj-y" meaning that none are ever
built as a module by anyone.

Since MODULE_ALIAS is a no-op for non-modular code, we can remove
them from these files.

We leave moduleparam.h behind since the files instantiate some boot
time configuration parameters with module_param() still.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 94d44776 22-Jun-2016 Jisheng Zhang <jszhang@marvell.com>

rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads

Commit abedf8e2419f ("rcu: Use simple wait queues where possible in
rcutree") converts Tree RCU's wait queues to simple wait queues,
but it incorrectly reverts the commit 2aa792e6faf1 ("rcu: Use
rcu_gp_kthread_wake() to wake up grace period kthreads"). This can
result in redundant self-wakeups.

This commit therefore replaces the simple wait-queue wakeups with
rcu_gp_kthread_wake(), thus avoiding the redundant wakeups.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4df83742 13-Jul-2016 Thomas Gleixner <tglx@linutronix.de>

rcu: Convert rcutree to hotplug state machine

Straight forward conversion to the state machine. Though the question arises
whether this needs really all these state transitions to work.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.982013161@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# bc75e999 03-Jun-2016 Mark Rutland <mark.rutland@arm.com>

rcu: Correctly handle sparse possible cpus

In many cases in the RCU tree code, we iterate over the set of cpus for
a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
per-cpu data for each cpu in this range. However, if the set of possible
cpus is sparse, some cpus described in this range are not possible, and
thus no per-cpu region will have been allocated (or initialised) for
them by the generic percpu code.

Erroneous accesses to a per-cpu area for these !possible cpus may fault
or may hit other data depending on the addressed generated when the
erroneous per cpu offset is applied. In practice, both cases have been
observed on arm64 hardware (the former being silent, but detectable with
additional patches).

To avoid issues resulting from this, we must iterate over the set of
*possible* cpus for a given leaf node. This patch add a new helper,
for_each_leaf_node_possible_cpu, to enable this. As iteration is often
intertwined with rcu_node local bitmask manipulation, a new
leaf_node_cpu_bit helper is added to make this simpler and more
consistent. The RCU tree code is made to use both of these where
appropriate.

Without this patch, running reboot at a shell can result in an oops
like:

[ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
[ 3369.083881] pgd = ffffffc3ecdda000
[ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
[ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[ 3369.101781] Modules linked in:
[ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G W 4.6.0+ #3
[ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
[ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
[ 3369.139479] pc : [<ffffff80081109a8>] lr : [<ffffff8008110924>] pstate: 200001c5
[ 3369.146860] sp : ffffffc3eb9435a0
[ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
[ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
[ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
[ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
[ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
[ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
[ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
[ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
[ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
[ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
[ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
[ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
[ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
[ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
[ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
...
[ 3369.972257] [<ffffff80081109a8>] sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.978685] [<ffffff80081128b4>] synchronize_rcu_expedited+0x64/0xa8
[ 3369.985026] [<ffffff80086b987c>] synchronize_net+0x24/0x30
[ 3369.990499] [<ffffff80086ddb54>] dev_deactivate_many+0x28c/0x298
[ 3369.996493] [<ffffff80086b6bb8>] __dev_close_many+0x60/0xd0
[ 3370.002052] [<ffffff80086b6d48>] __dev_close+0x28/0x40
[ 3370.007178] [<ffffff80086bf62c>] __dev_change_flags+0x8c/0x158
[ 3370.012999] [<ffffff80086bf718>] dev_change_flags+0x20/0x60
[ 3370.018558] [<ffffff80086cf7f0>] do_setlink+0x288/0x918
[ 3370.023771] [<ffffff80086d0798>] rtnl_newlink+0x398/0x6a8
[ 3370.029158] [<ffffff80086cee84>] rtnetlink_rcv_msg+0xe4/0x220
[ 3370.034891] [<ffffff80086e274c>] netlink_rcv_skb+0xc4/0xf8
[ 3370.040364] [<ffffff80086ced8c>] rtnetlink_rcv+0x2c/0x40
[ 3370.045663] [<ffffff80086e1fe8>] netlink_unicast+0x160/0x238
[ 3370.051309] [<ffffff80086e24b8>] netlink_sendmsg+0x2f0/0x358
[ 3370.056956] [<ffffff80086a0070>] sock_sendmsg+0x18/0x30
[ 3370.062168] [<ffffff80086a21cc>] ___sys_sendmsg+0x26c/0x280
[ 3370.067728] [<ffffff80086a30ac>] __sys_sendmsg+0x44/0x88
[ 3370.073027] [<ffffff80086a3100>] SyS_sendmsg+0x10/0x20
[ 3370.078153] [<ffffff8008085e70>] el0_svc_naked+0x24/0x28

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Dennis Chen <dennis.chen@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 088e9d25 02-Jun-2016 Daniel Bristot de Oliveira <bristot@redhat.com>

rcu: sysctl: Panic on RCU Stall

It is not always easy to determine the cause of an RCU stall just by
analysing the RCU stall messages, mainly when the problem is caused
by the indirect starvation of rcu threads. For example, when preempt_rcu
is not awakened due to the starvation of a timer softirq.

We have been hard coding panic() in the RCU stall functions for
some time while testing the kernel-rt. But this is not possible in
some scenarios, like when supporting customers.

This patch implements the sysctl kernel.panic_on_rcu_stall. If
set to 1, the system will panic() when an RCU stall takes place,
enabling the capture of a vmcore. The vmcore provides a way to analyze
all kernel/tasks states, helping out to point to the culprit and the
solution for the stall.

The kernel.panic_on_rcu_stall sysctl is disabled by default.

Changes from v1:
- Fixed a typo in the git log
- The if(sysctl_panic_on_rcu_stall) panic() is in a static function
- Fixed the CONFIG_TINY_RCU compilation issue
- The var sysctl_panic_on_rcu_stall is now __read_mostly

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3549c2bc 15-Apr-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Move expedited code from tree.c to tree_exp.h

People have been having some difficulty finding their way around the
RCU code. This commit therefore pulls some of the expedited grace-period
code from tree.c to a new tree_exp.h file. This commit is strictly code
movement, with the exception of a forward declaration that was added
for the sync_sched_exp_online_cleanup() function.

A subsequent commit will move the remaining expedited grace-period code
from tree_plugin.h to tree_exp.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d3acab65 10-Mar-2016 Peter Zijlstra <peterz@infradead.org>

rcu: Remove some superfluous lines

I think you'll find this condition is superfluous, as the whole function
is under #ifdef of that same.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 590d1757 10-Apr-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix outdated hotplug-exclusion comment in rcu_gp_init()

In the past, RCU grace-period initialization excluded CPU-hotplug
operations, but this is no longer the case. This commit therefore
removed an outdated comment in rcu_gp_init() claiming that these
are excluded.

Reported-by: Lihao Liang <lihao.liang@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 0d95092c 08-Apr-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix outdated rcu_scheduler_active comment

The comment header for rcu_scheduler_active states that it is used
to optimize synchronize_sched() at early boot. This is incorrect.
The synchronize_sched() function instead checks the number of online
CPUs. This commit therefore replaces the comment's synchronize_sched()
with synchronize_rcu(), which really does use rcu_scheduler_active for
this purpose.

Reported-by: Lihao Liang <lihao.liang@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6428671b 01-Jun-2016 Peter Zijlstra <peterz@infradead.org>

locking/mutex: Optimize mutex_trylock() fast-path

A while back Viro posted a number of 'interesting' mutex_is_locked()
users on IRC, one of those was RCU.

RCU seems to use mutex_is_locked() to avoid doing mutex_trylock(), the
regular load before modify pattern.

While the use isn't wrong per se, its curious in that its needed at all,
mutex_trylock() should be good enough on its own to avoid the pointless
cacheline bounces.

So fix those and remove the mutex_is_locked() (ab)use from RCU.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hpe.com>
Link: http://lkml.kernel.org/r/20160601185815.GW3190@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 291783b8 12-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Expedited-GP batch progress access to torturing

This commit provides rcu_exp_batches_completed() and
rcu_exp_batches_completed_sched() functions to allow torture-test modules
to check how many expedited grace period batches have completed.
These are analogous to the existing rcu_batches_completed(),
rcu_batches_completed_bh(), and rcu_batches_completed_sched() functions.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5dffed1e 17-Feb-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Dump ftrace buffer when kicking grace-period kthread

If it is necessary to kick the grace-period kthread, that is a good
time to dump the trace buffer in order to learn why kicking was needed.
This commit therefore does the dump.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8c7c4829 03-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Awaken grace-period kthread if too long since FQS

Recent kernels can fail to awaken the grace-period kthread for
quiescent-state forcing. This commit is a crude hack that does
a wakeup if a scheduling-clock interrupt sees that it has been
too long since force-quiescent-state (FQS) processing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fcfd0a23 03-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Make FQS schedule advance only if FQS happened

Currently, the force-quiescent-state (FQS) code in rcu_gp_kthread() can
advance the next FQS even if one was not executed last time. This can
happen due timeout-duration uncertainty. This commit therefore avoids
advancing the FQS schedule unless an FQS was just executed. In the
corner case where an FQS was not executed, but is due now, the code does
a one-jiffy wait.

This change prepares for kthread kicking.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 86057b80 31-Dec-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Awaken grace-period kthread when stalled

Recent kernels can fail to awaken the grace-period kthread for
quiescent-state forcing. This commit is a crude hack that does
a wakeup any time a stall is detected.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3b5f668e 16-Mar-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Overlap wakeups with next expedited grace period

The current expedited grace-period implementation makes subsequent grace
periods wait on wakeups for the prior grace period. This does not fit
the dictionary definition of "expedited", so this commit allows these two
phases to overlap. Doing this requires four waitqueues rather than two
because tasks can now be waiting on the previous, current, and next grace
periods. The fourth waitqueue makes the bit masking work out nicely.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# aff12cdf 16-Mar-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate expedited GP code into exp_funnel_lock()

This commit pulls the grace-period-start counter adjustment and tracing
from synchronize_rcu_expedited() and synchronize_sched_expedited()
into exp_funnel_lock(), thus eliminating some code duplication.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 179e5dcd 16-Mar-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate expedited GP tracing into rcu_exp_gp_seq_snap()

This commit moves some duplicate code from synchronize_rcu_expedited()
and synchronize_sched_expedited() into rcu_exp_gp_seq_snap(). This
doesn't save lines of code, but does eliminate a "tell me twice" issue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4ea3e85b 16-Mar-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate expedited GP code into rcu_exp_wait_wake()

Currently, synchronize_rcu_expedited() and rcu_sched_expedited() have
significant duplicate code. This commit therefore consolidates some of
this code into rcu_exp_wake(), which is now renamed to rcu_exp_wait_wake()
in recognition of its added responsibilities.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 356051e1 16-Mar-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Add exp_funnel_lock() fastpath

This commit speeds up the low-contention case, especially for systems
with large rcu_node trees, by attempting to directly acquire the
->exp_mutex. This fastpath checks the leaves and root first in
order to avoid excessive memory contention on the mutex itself.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f6a12f34 30-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Enforce expedited-GP fairness via funnel wait queue

The current mutex-based funnel-locking approach used by expedited grace
periods is subject to severe unfairness. The problem arises when a
few tasks, making a path from leaves to root, all wake up before other
tasks do. A new task can then follow this path all the way to the root,
which needlessly delays tasks whose grace period is done, but who do
not happen to acquire the lock quickly enough.

This commit avoids this problem by maintaining per-rcu_node wait queues,
along with a per-rcu_node counter that tracks the latest grace period
sought by an earlier task to visit this node. If that grace period
would satisfy the current task, instead of proceeding up the tree,
it waits on the current rcu_node structure using a pair of wait queues
provided for that purpose. This decouples awakening of old tasks from
the arrival of new tasks.

If the wakeups prove to be a bottleneck, additional kthreads can be
brought to bear for that purpose.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d40a4f09 08-Mar-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Shorten expedited_workdone* to exp_workdone*

Just a name change to save a few lines and a bit of typing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ec3833ed 11-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Force boolean subscript for expedited stall warnings

The cpu_online() function can return values other than 0 and 1, which
can result in subscript overflow when applied to a two-element array.
This commit allows for this behavior by using "!!" on the return value
from cpu_online() when used as a subscript.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e2fd9d35 30-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove expedited GP funnel-lock bypass

Commit #cdacbe1f91264 ("rcu: Add fastpath bypassing funnel locking")
turns out to be a pessimization at high load because it forces a tree
full of tasks to wait for an expedited grace period that they probably
do not need. This commit therefore removes this optimization.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4f415302 28-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Add expedited-grace-period event tracing

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# bea2de44 28-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Add funnel-locking tracing for expedited grace periods

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a1e12248 13-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Make cond_resched_rcu_qs() supply RCU-sched expedited QS

Although cond_resched_rcu_qs() supplies quiescent states to all flavors
of normal RCU grace periods, it does nothing for expedited RCU-sched
grace periods. This commit therefore adds a check for a need for a
quiescent state from the current CPU by an expedited RCU-sched grace
period, and invokes rcu_sched_qs() to supply that quiescent state if so.

Note that the check is racy in that we might be migrated to some other
CPU just after checking the per-CPU variable. This is OK because the
act of migration will do a context switch, which will supply the needed
quiescent state. The only downside is that we might do an unnecessary
call to rcu_sched_qs(), but the probability is low and the overhead
is small.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 251c617c 13-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Make expedited RCU-preempt stall warnings count accurately

Currently, synchronize_sched_expedited_wait() simply sets the ndetected
variable to the rcu_print_task_exp_stall() return value. This means
that if the last rcu_node structure has no stalled tasks, record of
any stalled tasks in previous rcu_node structures is lost, which can
in turn result in failure to dump out the blocking rcu_node structures.
Or could, had the test been correct.

This commit therefore adds the return value of rcu_print_task_exp_stall()
to ndetected and corrects the later test for ndetected.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 28728dd3 12-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Make expedited RCU-sched grace period immediately detect idle

Currently, sync_sched_exp_handler() will force a reschedule unless
this CPU has already checked in or unless a reschedule has already
been called for. This is clearly wasteful if sync_sched_exp_handler()
interrupted an idle CPU, so this commit immediately reports the
quiescent state in that case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 274529ba 21-Mar-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate dumping of ftrace buffer

This commit consolidates a couple definitions and several calls for
single-shot ftrace-buffer dumping.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 27d50c7e 26-Feb-2016 Thomas Gleixner <tglx@linutronix.de>

rcu: Make CPU_DYING_IDLE an explicit call

Make the RCU CPU_DYING_IDLE callback an explicit function call, so it gets
invoked at the proper place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.870167933@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# abedf8e2 19-Feb-2016 Paul Gortmaker <paul.gortmaker@windriver.com>

rcu: Use simple wait queues where possible in rcutree

As of commit dae6e64d2bcfd ("rcu: Introduce proper blocking to no-CBs kthreads
GP waits") the RCU subsystem started making use of wait queues.

Here we convert all additions of RCU wait queues to use simple wait queues,
since they don't need the extra overhead of the full wait queue features.

Originally this was done for RT kernels[1], since we would get things like...

BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
Pid: 8, comm: rcu_preempt Not tainted
Call Trace:
[<ffffffff8106c8d0>] __might_sleep+0xd0/0xf0
[<ffffffff817d77b4>] rt_spin_lock+0x24/0x50
[<ffffffff8106fcf6>] __wake_up+0x36/0x70
[<ffffffff810c4542>] rcu_gp_kthread+0x4d2/0x680
[<ffffffff8105f910>] ? __init_waitqueue_head+0x50/0x50
[<ffffffff810c4070>] ? rcu_gp_fqs+0x80/0x80
[<ffffffff8105eabb>] kthread+0xdb/0xe0
[<ffffffff8106b912>] ? finish_task_switch+0x52/0x100
[<ffffffff817e0754>] kernel_thread_helper+0x4/0x10
[<ffffffff8105e9e0>] ? __init_kthread_worker+0x60/0x60
[<ffffffff817e0750>] ? gs_change+0xb/0xb

...and hence simple wait queues were deployed on RT out of necessity
(as simple wait uses a raw lock), but mainline might as well take
advantage of the more streamline support as well.

[1] This is a carry forward of work from v3.10-rt; the original conversion
was by Thomas on an earlier -rt version, and Sebastian extended it to
additional post-3.10 added RCU waiters; here I've added a commit log and
unified the RCU changes into one, and uprev'd it to match mainline RCU.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 065bb78c 19-Feb-2016 Daniel Wagner <daniel.wagner@bmw-carit.de>

rcu: Do not call rcu_nocb_gp_cleanup() while holding rnp->lock

rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently,
this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will
not enable the IRQs. lockdep is happy.

By switching over using swait this is not true anymore. swake_up_all()
enables the IRQs while processing the waiters. __do_softirq() can now
run and will eventually call rcu_process_callbacks() which wants to
grap nrp->lock.

Let's move the rcu_nocb_gp_cleanup() call outside the lock before we
switch over to swait.

If we would hold the rnp->lock and use swait, lockdep reports
following:

=================================
[ INFO: inconsistent lock state ]
4.2.0-rc5-00025-g9a73ba0 #136 Not tainted
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
(rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
{IN-SOFTIRQ-W} state was registered at:
[<ffffffff81109b9f>] __lock_acquire+0xd5f/0x21e0
[<ffffffff8110be0f>] lock_acquire+0xdf/0x2b0
[<ffffffff81841cc9>] _raw_spin_lock_irqsave+0x59/0xa0
[<ffffffff81136991>] rcu_process_callbacks+0x141/0x3c0
[<ffffffff810b1a9d>] __do_softirq+0x14d/0x670
[<ffffffff810b2214>] irq_exit+0x104/0x110
[<ffffffff81844e96>] smp_apic_timer_interrupt+0x46/0x60
[<ffffffff81842e70>] apic_timer_interrupt+0x70/0x80
[<ffffffff810dba66>] rq_attach_root+0xa6/0x100
[<ffffffff810dbc2d>] cpu_attach_domain+0x16d/0x650
[<ffffffff810e4b42>] build_sched_domains+0x942/0xb00
[<ffffffff821777c2>] sched_init_smp+0x509/0x5c1
[<ffffffff821551e3>] kernel_init_freeable+0x172/0x28f
[<ffffffff8182cdce>] kernel_init+0xe/0xe0
[<ffffffff8184231f>] ret_from_fork+0x3f/0x70
irq event stamp: 76
hardirqs last enabled at (75): [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
hardirqs last disabled at (76): [<ffffffff8184116f>] _raw_spin_lock_irq+0x1f/0x90
softirqs last enabled at (0): [<ffffffff810a8df2>] copy_process.part.26+0x602/0x1cf0
softirqs last disabled at (0): [< (null)>] (null)
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(rcu_node_1);
<Interrupt>
lock(rcu_node_1);
*** DEADLOCK ***
1 lock held by rcu_preempt/8:
#0: (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
stack backtrace:
CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136
Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014
0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0
0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b
0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f
Call Trace:
[<ffffffff818379e0>] dump_stack+0x4f/0x7b
[<ffffffff8110813b>] print_usage_bug+0x1db/0x1e0
[<ffffffff8102fa4f>] ? save_stack_trace+0x2f/0x50
[<ffffffff811087ad>] mark_lock+0x66d/0x6e0
[<ffffffff81107790>] ? check_usage_forwards+0x150/0x150
[<ffffffff81108898>] mark_held_locks+0x78/0xa0
[<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff81108a28>] trace_hardirqs_on_caller+0x168/0x220
[<ffffffff81108aed>] trace_hardirqs_on+0xd/0x10
[<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
[<ffffffff810fd1c7>] swake_up_all+0xb7/0xe0
[<ffffffff811386e1>] rcu_gp_kthread+0xab1/0xeb0
[<ffffffff811089bf>] ? trace_hardirqs_on_caller+0xff/0x220
[<ffffffff81841341>] ? _raw_spin_unlock_irq+0x41/0x60
[<ffffffff81137c30>] ? rcu_barrier+0x20/0x20
[<ffffffff810d2014>] kthread+0x104/0x120
[<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
[<ffffffff8184231f>] ret_from_fork+0x3f/0x70
[<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 4b455dc3 27-Jan-2016 Paul E. McKenney <paulmck@kernel.org>

rcu: Catch up rcu_report_qs_rdp() comment with reality

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 67c583a7 28-Dec-2015 Boqun Feng <boqun.feng@gmail.com>

RCU: Privatize rcu_node::lock

In patch:

"rcu: Add transitivity to remaining rcu_node ->lock acquisitions"

All locking operations on rcu_node::lock are replaced with the wrappers
because of the need of transitivity, which indicates we should never
write code using LOCK primitives alone(i.e. without a proper barrier
following) on rcu_node::lock outside those wrappers. We could detect
this kind of misuses on rcu_node::lock in the future by adding __private
modifier on rcu_node::lock.

To privatize rcu_node::lock, unlock wrappers are also needed. Replacing
spinlock unlocks with these wrappers not only privatizes rcu_node::lock
but also makes it easier to figure out critical sections of rcu_node.

This patch adds __private modifier to rcu_node::lock and makes every
access to it wrapped by ACCESS_PRIVATE(). Besides, unlock wrappers are
added and raw_spin_unlock(&rnp->lock) and its friends are replaced with
those wrappers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 1914aab5 26-Dec-2015 Chen Gang <chengang@emindsoft.com.cn>

rcu: Remove useless rcu_data_p when !PREEMPT_RCU

The related warning from gcc 6.0:

In file included from kernel/rcu/tree.c:4630:0:
kernel/rcu/tree_plugin.h:810:40: warning: ‘rcu_data_p’ defined but not used [-Wunused-const-variable]
static struct rcu_data __percpu *const rcu_data_p = &rcu_sched_data;
^~~~~~~~~~

Also remove always redundant rcu_data_p in tree.c.

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 23a9bacd 13-Dec-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Set rdp->gpwrap when CPU is idle

Commit #e3663b1024d1 ("rcu: Handle gpnum/completed wrap while dyntick
idle") sets rdp->gpwrap on the wrong side of the "if" statement in
dyntick_save_progress_counter(), that is, it sets it when the CPU is
not idle instead of when it is idle. Of course, if the CPU is not idle,
its rdp->gpnum won't be lagging beind the global rsp->gpnum, which means
that rdp->gpwrap will never be set.

This commit therefore moves this code to the proper leg of that "if"
statement. This change means that the "else" cause is just "return 0"
and the "then" clause ends with "return 1", so also move the "return 0"
to follow the "if", dropping the "else" clause.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4914950a 11-Dec-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Stop treating in-kernel CPU-bound workloads as errors

Commit 4a81e8328d379 ("Reduce overhead of cond_resched() checks for RCU")
handles the error case where a nohz_full loops indefinitely in the kernel
with the scheduling-clock interrupt disabled. However, this handling
includes IPIing the CPU running the offending loop, which is not what
we want for real-time workloads. And there are starting to be real-time
CPU-bound in-kernel workloads, and these must be handled without IPIing
the CPU, at least not in the common case. Therefore, this situation can
no longer be dismissed as an error case.

This commit therefore splits the handling out, so that the setting of
bits in the per-CPU rcu_sched_qs_mask variable is done relatively early,
but if the problem persists, resched_cpu() is eventually used to IPI the
CPU containing the offending loop. Assuming that in-kernel CPU-bound
loops used by real-time tasks contain frequent calls cond_resched_rcu_qs()
(as in more than once per few tens of milliseconds), the real-time tasks
will never be IPIed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>


# 8994515c 10-Dec-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Update rcu_report_qs_rsp() comment

The header comment for rcu_report_qs_rsp() was obsolete, dating well
before the advent of RCU grace-period kthreads. This commit therefore
brings this comment back into alignment with current reality.

Reported-by: Lihao Liang <lihao.liang@cs.ox.ac.uk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# bb53e416 10-Dec-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Assign false instead of 0 for ->core_needs_qs

A zero seems to have escaped earlier true/false substitution efforts,
so this commit changes 0 to false for the ->core_needs_qs boolean field.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 45fed3e7 08-Nov-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_gp_init() be bool rather than int

The return value from rcu_gp_init() is always used as a bool, so
this commit makes it be a bool.

Reported-by: Iftekhar Ahmed <ahmedi@oregonstate.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e11f1335 04-Nov-2015 Peter Zijlstra <peterz@infradead.org>

rcu: Move wakeup out from under rnp->lock

This patch removes a potential deadlock hazard by moving the
wake_up_process() in rcu_spawn_gp_kthread() out from under rnp->lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 7c9906ca 31-Oct-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't redundantly disable irqs in rcu_irq_{enter,exit}()

This commit replaces a local_irq_save()/local_irq_restore() pair with
a lockdep assertion that interrupts are already disabled. This should
remove the corresponding overhead from the interrupt entry/exit fastpaths.

This change was inspired by the fact that Iftekhar Ahmed's mutation
testing showed that removing rcu_irq_enter()'s call to local_ird_restore()
had no effect, which might indicate that interrupts were always enabled
anyway.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d117c8aa 31-Oct-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Make cpu_needs_another_gp() be bool

The cpu_needs_another_gp() function is currently of type int, but only
returns zero or one. Bow to reality and make it be of type bool.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a87f203e 20-Oct-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate unused rcu_init_one() argument

Now that the rcu_state structure's ->rda field is compile-time initialized,
there is no need to pass the per-CPU rcu_data structure into rcu_init_one().
This commit therefore eliminates this now-unused parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6b50e119 17-Nov-2015 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Print symbolic name for ->gp_state

Currently, ->gp_state is printed as an integer, which slows debugging.
This commit therefore prints a symbolic name in addition to the integer.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Updated to fix relational operator called out by Dan Carpenter. ]
[ paulmck: More "const", as suggested by Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# b1adb3e2 01-Oct-2015 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Dump stack when GP kthread stalls

This commit increases debug information in the case where the grace-period
kthread is being prevented from running by dumping that kthread's stack.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Split into prior commit and this commit, as suggested by
Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# a0e3a3aa 05-Dec-2015 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Flag nonexistent RCU GP kthread

Currently, if the RCU grace-period kthread has not yet been created,
in which case the starvation-check code will print zero for the state,
which maps to TASK_RUNNING. This could clearly be quite confusing, so
this commit prints ~0, which does not map to any legal ->state value.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 46a5d164 07-Oct-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Stop disabling interrupts in scheduler fastpaths

We need the scheduler's fastpaths to be, well, fast, and unnecessarily
disabling and re-enabling interrupts is not necessarily consistent with
this goal. Especially given that there are regions of the scheduler that
already have interrupts disabled.

This commit therefore moves the call to rcu_note_context_switch()
to one of the interrupts-disabled regions of the scheduler, and
removes the now-redundant disabling and re-enabling of interrupts from
rcu_note_context_switch() and the functions it calls.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
by Peter Zijlstra. ]


# fecbf6f0 28-Sep-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Simplify rcu_sched_qs() control flow

This commit applies an early-exit approach to rcu_sched_qs(), reducing
the nesting level and saving a line of code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3dc5dbe9 26-Sep-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Move lock_class_key to local scope

Currently, the rcu_node_class[], rcu_fqs_class[], and rcu_exp_class[]
arrays needlessly pollute the global namespace within tree.c. This
commit therefore converts them to static local variables within
rcu_init_one().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5a9be7c6 24-Nov-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add rcu_normal kernel parameter to suppress expediting

Although expedited grace periods can be quite useful, and although their
OS jitter has been greatly reduced, they can still pose problems for
extreme real-time workloads. This commit therefore adds a rcu_normal
kernel boot parameter (which can also be manipulated via sysfs)
to suppress expedited grace periods, that is, to treat requests for
expedited grace periods as if they were requests for normal grace periods.
If both rcu_expedited and rcu_normal are specified, rcu_normal wins.
This means that if you are relying on expedited grace periods to speed up
boot, you will want to specify rcu_expedited on the kernel command line,
and then specify rcu_normal via sysfs once boot completes.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 72611ab9 17-Nov-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add more diagnostics to expedited stall warning messages.

This commit adds print statements that check the rcu_node structure to
find which ->expmask bits and which ->exp_tasks structures are blocking
the current expedited grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 73f36f9d 17-Nov-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Make expedited grace periods resolve stall-warning ties

Currently, if a grace period ends just as the stall-warning timeout
fires, an empty stall warning will be printed. This is not helpful,
so this commit avoids these useless warnings by rechecking completion
after awakening in synchronize_sched_expedited_wait().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# df5bd514 01-Oct-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Reduce expedited GP memory contention via per-CPU variables

Currently, the piggybacked-work checks carried out by sync_exp_work_done()
atomically increment a small set of variables (the ->expedited_workdone0,
->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
fields in the rcu_state structure), which will form a memory-contention
bottleneck given a sufficiently large number of CPUs concurrently invoking
either synchronize_rcu_expedited() or synchronize_sched_expedited().

This commit therefore moves these for fields to the per-CPU rcu_data
structure, eliminating the memory contention. The show_rcuexp() function
also changes to sum up each field in the rcu_data structures.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 1307f214 29-Sep-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Invert sync_rcu_exp_select_cpus() "if" statement

This commit saves a couple lines of code and reduces indentation
by inverting the sense of an "if" statement in the function
sync_rcu_exp_select_cpus().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 886ef5a1 29-Sep-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Move smp_mb() from rcu_seq_snap() to rcu_exp_gp_seq_snap()

The memory barrier in rcu_seq_snap() is needed only for grace periods,
so this commit moves it to the grace-period-oriented wrapper
rcu_exp_gp_seq_snap().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 06f60de1 29-Sep-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Short-circuit synchronize_sched_expedited() if only one CPU

If there is only one CPU, then invoking synchronize_sched_expedited()
is by definition a grace period. This commit checks for this condition
and does a short-circuit return in that case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6cf10081 08-Oct-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add transitivity to remaining rcu_node ->lock acquisitions

The rule is that all acquisitions of the rcu_node structure's ->lock
must provide transitivity: The lock is not acquired that frequently,
and sorting out exactly which required it and which did not would be
a maintenance nightmare. This commit therefore supplies the needed
transitivity to the remaining ->lock acquisitions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2a67e741 07-Oct-2015 Peter Zijlstra <peterz@infradead.org>

rcu: Create transitive rnp->lock acquisition functions

Providing RCU's memory-ordering guarantees requires that the rcu_node
tree's locking provide transitive memory ordering, which the Linux kernel's
spinlocks currently do not provide unless smp_mb__after_unlock_lock()
is used. Having a separate smp_mb__after_unlock_lock() after each and
every lock acquisition is error-prone, hard to read, and a bit annoying,
so this commit provides wrapper functions that pull in the
smp_mb__after_unlock_lock() invocations.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 338b0f76 03-Sep-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Better hotplug handling for synchronize_sched_expedited()

Earlier versions of synchronize_sched_expedited() can prematurely end
grace periods due to the fact that a CPU marked as cpu_is_offline()
can still be using RCU read-side critical sections during the time that
CPU makes its last pass through the scheduler and into the idle loop
and during the time that a given CPU is in the process of coming online.
This commit therefore eliminates this window by adding additional
interaction with the CPU-hotplug operations.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c5865638 18-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add tasks to expedited stall-warning messages

This commit adds task-print ability to the expedited RCU CPU stall
warning messages in preparation for adding stall warnings to
synchornize_rcu_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 74611ecb 18-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add online/offline info to expedited stall warning message

This commit makes the RCU CPU stall warning message print online/offline
indications immediately after the CPU number. A "O" indicates global
offline, a "." global online, and a "o" indicates RCU believes that the
CPU is offline for the current grace period and "." otherwise, and an
"N" indicates that RCU believes that the CPU will be offline for the
next grace period, and "." otherwise, all right after the CPU number.
So for CPU 10, you would normally see "10-...:" indicating that everything
believes that the CPU is online.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# dcdb8807 15-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate expedited CPU selection

Now that sync_sched_exp_select_cpus() and sync_rcu_exp_select_cpus()
are identical aside from the the argument to smp_call_function_single(),
this commit consolidates them with a functional argument.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 66fe6cbe 15-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Prepare for consolidating expedited CPU selection

This commit brings sync_sched_exp_select_cpus() into alignment with
sync_rcu_exp_select_cpus(), as a first step towards consolidating them
into one function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 807226e2 07-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Stop excluding CPU hotplug in synchronize_sched_expedited()

Now that synchronize_sched_expedited() uses IPIs, a hook in
rcu_sched_qs(), and the ->expmask field in the rcu_node combining
tree, it is no longer necessary to exclude CPU hotplug. Any
races with CPU hotplug will be detected when attempting to send
the IPI. This commit therefore removes the code excluding
CPU hotplug operations.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 83c2c735 06-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Stop silencing lockdep false positive for expedited grace periods

This reverts commit af859beaaba4 (rcu: Silence lockdep false positive
for expedited grace periods). Because synchronize_rcu_expedited()
no longer invokes synchronize_sched_expedited(), ->exp_funnel_mutex
acquisition is no longer nested, so the false positive no longer happens.
This commit therefore removes the extra lockdep data structures, as they
are no longer needed.


# 6587a23b 06-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Switch synchronize_sched_expedited() to IPI

This commit switches synchronize_sched_expedited() from stop_one_cpu_nowait()
to smp_call_function_single(), thus moving from an IPI and a pair of
context switches to an IPI and a single pass through the scheduler.
Of course, if the scheduler actually does decide to switch to a different
task, there will still be a pair of context switches, but there would
likely have been a pair of context switches anyway, just a bit later.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 77f81fe0 09-Sep-2015 Petr Mladek <pmladek@suse.com>

rcu: Finish folding ->fqs_state into ->gp_state

Commit commit 4cdfc175c25c89ee ("rcu: Move quiescent-state forcing
into kthread") started the process of folding the old ->fqs_state into
->gp_state, but did not complete it. This situation does not cause
any malfunction, but can result in extremely confusing trace output.
This commit completes this task of eliminating ->fqs_state in favor
of ->gp_state.

The old ->fqs_state was also used to decide when to collect dyntick-idle
snapshots. For this purpose, we add a boolean variable into the kthread,
which is set on the first call to rcu_gp_fqs() for a given grace period
and clear otherwise.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# ee968ac6 31-Jul-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate panic when silly boot-time fanout specified

This commit loosens rcutree.rcu_fanout_leaf range checks
and replaces a panic() with a fallback to compile-time values.
This fallback is accompanied by a WARN_ON(), and both occur when the
rcutree.rcu_fanout_leaf value is too small to accommodate the number of
CPUs. For example, given the current four-level limit for the rcu_node
tree, a system with more than 16 CPUs built with CONFIG_FANOUT=2 must
have rcutree.rcu_fanout_leaf larger than 2.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# bb73c52b 30-Jul-2015 Boqun Feng <boqun.feng@gmail.com>

rcu: Don't disable preemption for Tiny and Tree RCU readers

Because preempt_disable() maps to barrier() for non-debug builds,
it forces the compiler to spill and reload registers. Because Tree
RCU and Tiny RCU now only appear in CONFIG_PREEMPT=n builds, these
barrier() instances generate needless extra code for each instance of
rcu_read_lock() and rcu_read_unlock(). This extra code slows down Tree
RCU and bloats Tiny RCU.

This commit therefore removes the preempt_disable() and preempt_enable()
from the non-preemptible implementations of __rcu_read_lock() and
__rcu_read_unlock(), respectively. However, for debug purposes,
preempt_disable() and preempt_enable() are still invoked if
CONFIG_PREEMPT_COUNT=y, because this allows detection of sleeping inside
atomic sections in non-preemptible kernels.

However, Tiny and Tree RCU operates by coalescing all RCU read-side
critical sections on a given CPU that lie between successive quiescent
states. It is therefore necessary to compensate for removing barriers
from __rcu_read_lock() and __rcu_read_unlock() by adding them to a
couple of the RCU functions invoked during quiescent states, namely to
rcu_all_qs() and rcu_note_context_switch(). However, note that the latter
is more paranoia than necessity, at least until link-time optimizations
become more aggressive.

This is based on an earlier patch by Paul E. McKenney, fixing
a bug encountered in kernels built with CONFIG_PREEMPT=n and
CONFIG_PREEMPT_COUNT=y.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b6a4ae76 28-Jul-2015 Boqun Feng <boqun.feng@gmail.com>

rcu: Use rcu_callback_t in call_rcu*() and friends

As we now have rcu_callback_t typedefs as the type of rcu callbacks, we
should use it in call_rcu*() and friends as the type of parameters. This
could save us a few lines of code and make it clear which function
requires an rcu callbacks rather than other callbacks as its argument.

Besides, this can also help cscope to generate a better database for
code reading.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 5b74c458 06-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Make ->cpu_no_qs be a union for aggregate OR

This commit converts the rcu_data structure's ->cpu_no_qs field
to a union. The bytewise side of this union allows individual access
to indications as to whether this CPU needs to find a quiescent state
for a normal (.norm) and/or expedited (.exp) grace period. The setwise
side of the union allows testing whether or not a quiescent state is
needed at all, for either type of grace period.

For now, only .norm is used. A later commit will introduce the expedited
usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 0d43eb34 06-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Invert passed_quiesce and rename to cpu_no_qs

This commit inverts the sense of the rcu_data structure's ->passed_quiesce
field and renames it to ->cpu_no_qs. This will allow a later commit to
use an "aggregate OR" operation to test expedited as well as normal grace
periods without added overhead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 97c668b8 06-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Rename qs_pending to core_needs_qs

An upcoming commit needs to invert the sense of the ->passed_quiesce
rcu_data structure field, so this commit is taking this opportunity
to clarify things a bit by renaming ->qs_pending to ->core_needs_qs.

So if !rdp->core_needs_qs, then this CPU need not concern itself with
quiescent states, in particular, it need not acquire its leaf rcu_node
structure's ->lock to check. Otherwise, it needs to report the next
quiescent state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# bce5fa12 05-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Move synchronize_sched_expedited() to combining tree

Currently, synchronize_sched_expedited() uses a single global counter
to track the number of remaining context switches that the current
expedited grace period must wait on. This is problematic on large
systems, where the resulting memory contention can be pathological.
This commit therefore makes synchronize_sched_expedited() instead use
the combining tree in the same manner as synchronize_rcu_expedited(),
keeping memory contention down to a dull roar.

This commit creates a temporary function sync_sched_exp_select_cpus()
that is very similar to sync_rcu_exp_select_cpus(). A later commit
will consolidate these two functions, which becomes possible when
synchronize_sched_expedited() switches from stop_one_cpu_nowait() to
smp_call_function_single().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8203d6d0 02-Aug-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Use single-stage IPI algorithm for RCU expedited grace period

The current preemptible-RCU expedited grace-period algorithm invokes
synchronize_sched_expedited() to enqueue all tasks currently running
in a preemptible-RCU read-side critical section, then waits for all the
->blkd_tasks lists to drain. This works, but results in both an IPI and
a double context switch even on CPUs that do not happen to be running
in a preemptible RCU read-side critical section.

This commit implements a new algorithm that causes less OS jitter.
This new algorithm IPIs all online CPUs that are not idle (from an
RCU perspective), but refrains from self-IPIs. If a CPU receiving
this IPI is not in a preemptible RCU read-side critical section (or
is just now exiting one), it pushes quiescence up the rcu_node tree,
otherwise, it sets a flag that will be handled by the upcoming outermost
rcu_read_unlock(), which will then push quiescence up the tree.

The expedited grace period must of course wait on any pre-existing blocked
readers, and newly blocked readers must be queued carefully based on
the state of both the normal and the expedited grace periods. This
new queueing approach also avoids the need to update boost state,
courtesy of the fact that blocked tasks are no longer ever migrated to
the root rcu_node structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b9585e94 31-Jul-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate tree setup for synchronize_rcu_expedited()

This commit replaces sync_rcu_preempt_exp_init1(() and
sync_rcu_preempt_exp_init2() with sync_exp_reset_tree_hotplug()
and sync_exp_reset_tree(), which will also be used by
synchronize_sched_expedited(), and sync_rcu_exp_select_nodes(), which
contains code specific to synchronize_rcu_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 7922cd0e 31-Jul-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_report_exp_rnp() to allow consolidation

This is a nearly pure code-movement commit, moving rcu_report_exp_rnp(),
sync_rcu_preempt_exp_done(), and rcu_preempted_readers_exp() so
that later commits can make synchronize_sched_expedited() use them.
The non-code-movement portion of this commit tags rcu_report_exp_rnp()
as __maybe_unused to avoid build errors when CONFIG_PREEMPT=n.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f4ecea30 29-Jul-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Use rsp->expedited_wq instead of sync_rcu_preempt_exp_wq

Now that there is an ->expedited_wq waitqueue in each rcu_state structure,
there is no need for the sync_rcu_preempt_exp_wq global variable. This
commit therefore substitutes ->expedited_wq for sync_rcu_preempt_exp_wq.
It also initializes ->expedited_wq only once at boot instead of at the
start of each expedited grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 19a5ecde 20-Sep-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Suppress lockdep false positive for rcp->exp_funnel_mutex

In kernels built with CONFIG_PREEMPT=y, synchronize_rcu_expedited()
invokes synchronize_sched_expedited() while holding RCU-preempt's
root rcu_node structure's ->exp_funnel_mutex, which is acquired after
the rcu_data structure's ->exp_funnel_mutex. The first thing that
synchronize_sched_expedited() will do is acquire RCU-sched's rcu_data
structure's ->exp_funnel_mutex. There is no danger of an actual deadlock
because the locking order is always from RCU-preempt's expedited mutexes
to those of RCU-sched. Unfortunately, lockdep considers both rcu_data
structures' ->exp_funnel_mutex to be in the same lock class and therefore
reports a deadlock cycle.

This commit silences this false positive by placing RCU-sched's rcu_data
structures' ->exp_funnel_mutex locks into their own lock class.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# af859bea 19-Jul-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Silence lockdep false positive for expedited grace periods

In a CONFIG_PREEMPT=y kernel, synchronize_rcu_expedited()
acquires the ->exp_funnel_mutex in rcu_preempt_state, then invokes
synchronize_sched_expedited, which acquires the ->exp_funnel_mutex in
rcu_sched_state. There can be no deadlock because rcu_preempt_state
->exp_funnel_mutex acquisition always precedes that of rcu_sched_state.
But lockdep does not know that, so it gives false-positive splats.

This commit therefore associates a separate lock_class_key structure
with the rcu_sched_state structure's ->exp_funnel_mutex, allowing
lockdep to see the lock ordering, avoiding the false positives.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f78f5b90 18-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()

This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for
consistency with the WARN() series of macros. This also requires
inverting the sense of the conditional, which this commit also does.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>


# 46f00d18 16-Jun-2015 Alexei Starovoitov <ast@kernel.org>

rcu: Make rcu_is_watching() really notrace

Although rcu_is_watching() is marked notrace, it invokes preempt_disable()
and preempt_enable(), both of which can be traced. This defeats the
purpose of the notrace on rcu_is_watching(), so this commit substitutes
preempt_disable_notrace() and preempt_enable_notrace().

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>


# 24560056 30-May-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add RCU-sched flavors of get-state and cond-sync

The get_state_synchronize_rcu() and cond_synchronize_rcu() functions
allow polling for grace-period completion, with an actual wait for a
grace period occurring only when cond_synchronize_rcu() is called too
soon after the corresponding get_state_synchronize_rcu(). However,
these functions work only for vanilla RCU. This commit adds the
get_state_synchronize_sched() and cond_synchronize_sched(), which provide
the same capability for RCU-sched.

Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# cdacbe1f 11-Jul-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add fastpath bypassing funnel locking

In the common case, there will be only one expedited grace period in
the system at a given time, in which case it is not helpful to use
funnel locking. This commit therefore adds a fastpath that bypasses
funnel locking when the root ->exp_funnel_mutex is not held.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 32bb1c79 02-Jul-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Rename RCU_GP_DONE_FQS to RCU_GP_DOING_FQS

The grace-period kthread sleeps waiting to do a force-quiescent-state
scan, and when awakened sets rsp->gp_state to RCU_GP_DONE_FQS.
However, this is confusing because the kthread has not done the
force-quiescent-state, but is instead just starting to do it. This commit
therefore renames RCU_GP_DONE_FQS to RCU_GP_DOING_FQS in order to make
things a bit easier on reviewers.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b9a425cf 01-Jul-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Pull out wait_event*() condition into helper function

The condition for the wait_event_interruptible_timeout() that waits
to do the next force-quiescent-state scan is a bit ornate:

((gf = READ_ONCE(rsp->gp_flags)) &
RCU_GP_FLAG_FQS) ||
(!READ_ONCE(rnp->qsmask) &&
!rcu_preempt_blocked_readers_cgp(rnp))

This commit therefore pulls this condition out into a helper function
and comments its component conditions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# cf3620a6 30-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add stall warnings to synchronize_sched_expedited()

Although synchronize_sched_expedited() historically has no RCU CPU stall
warnings, the availability of the rcupdate.rcu_expedited boot parameter
invalidates the old assumption that synchronize_sched()'s stall warnings
would suffice. This commit therefore adds RCU CPU stall warnings to
synchronize_sched_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2cd6ffaf 29-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Extend expedited funnel locking to rcu_data structure

The strictly rcu_node based funnel-locking scheme works well in many
cases, but systems with CONFIG_RCU_FANOUT_LEAF=64 won't necessarily get
all that much concurrency. This commit therefore extends the funnel
locking into the per-CPU rcu_data structure, providing concurrency equal
to the number of CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 704dd435 27-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate last open-coded expedited memory barrier

One of the requirements on RCU grace periods is that if there is a
causal chain of operations that starts after one grace period and
ends before another grace period, then the two grace periods must
be serialized. There has been (and might still be) code that relies
on this, for example, certain types of reference-counting code that
does a call_rcu() within an RCU callback function.

This requirement is why there is an smp_mb() at the end of both
synchronize_sched_expedited() and synchronize_rcu_expedited().
However, this is the only smp_mb() in these functions, so it would
be nicer to consolidate it into rcu_exp_gp_seq_end(). This commit
does just that.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4f525a52 26-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Apply rcu_seq operations to _rcu_barrier()

The rcu_seq operations were open-coded in _rcu_barrier(), so this commit
replaces the open-coding with the shiny new rcu_seq operations.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 29fd9309 25-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Use funnel locking for synchronize_rcu_expedited()'s polling loop

This commit gets rid of synchronize_rcu_expedited()'s mutex_trylock()
polling loop in favor of the funnel-locking scheme that was abstracted
from synchronize_sched_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 7fd0ddc5 25-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix synchronize_sched_expedited() type error for "s"

The type of "s" has been "long" rather than the correct "unsigned long"
for quite some time. This commit fixes this type error.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b09e5f86 25-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract funnel locking from synchronize_sched_expedited()

This commit abstracts funnel locking from synchronize_sched_expedited()
so that it may be used by synchronize_rcu_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 28f00767 25-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract sequence counting from synchronize_sched_expedited()

This commit creates rcu_exp_gp_seq_start() and rcu_exp_gp_seq_end() to
bracket an expedited grace period, rcu_exp_gp_seq_snap() to snapshot the
sequence counter, and rcu_exp_gp_seq_done() to check to see if a full
expedited grace period has elapsed since the snapshot. These will be
applied to synchronize_rcu_expedited(). These are defined in terms of
underlying rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(), rcu_seq_done(),
which will be applied to _rcu_barrier().

One reason that this commit doesn't use the seqcount primitives themselves
is that the smp_wmb() in those primitive is insufficient due to the fact
that expedited grace periods do reads as well as writes. In addition,
the read-side seqcount primitives detect a potentially partial change,
where the expedited primitives instead need a guaranteed full change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3a6d7c64 25-Jun-2015 Peter Zijlstra <peterz@infradead.org>

rcu: Make expedited GP CPU stoppage asynchronous

Sequentially stopping the CPUs slows down expedited grace periods by
at least a factor of two, based on rcutorture's grace-period-per-second
rate. This is a conservative measure because rcutorture uses unusually
long RCU read-side critical sections and because rcutorture periodically
quiesces the system in order to test RCU's ability to ramp down to and
up from the idle state. This commit therefore replaces the stop_one_cpu()
with stop_one_cpu_nowait(), using an atomic-counter scheme to determine
when all CPUs have passed through the stopped state.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 385b73c0 24-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Get rid of synchronize_sched_expedited()'s polling loop

This commit gets rid of synchronize_sched_expedited()'s mutex_trylock()
polling loop in favor of a funnel-locking scheme based on the rcu_node
tree. The work-done check is done at each level of the tree, allowing
high-contention situations to be resolved quickly with reasonable levels
of mutex contention.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d6ada2cf 24-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Rework synchronize_sched_expedited() counter handling

Now that synchronize_sched_expedited() have a mutex, it can use simpler
work-already-done detection scheme. This commit simplifies this scheme
by using something similar to the sequence-locking counter scheme.
A counter is incremented before and after each grace period, so that
the counter is odd in the midst of the grace period and even otherwise.
So if the counter has advanced to the second even number that is
greater than or equal to the snapshot, the required grace period has
already happened.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c190c3b1 23-Jun-2015 Peter Zijlstra <peterz@infradead.org>

rcu: Switch synchronize_sched_expedited() to stop_one_cpu()

The synchronize_sched_expedited() currently invokes try_stop_cpus(),
which schedules the stopper kthreads on each online non-idle CPU,
and waits until all those kthreads are running before letting any
of them stop. This is disastrous for real-time workloads, which
get hit with a preemption that is as long as the longest scheduling
latency on any CPU, including any non-realtime housekeeping CPUs.
This commit therefore switches to using stop_one_cpu() on each CPU
in turn. This avoids inflicting the worst-case scheduling latency
on the worst-case CPU onto all other CPUs, and also simplifies the
code a little bit.

Follow-up commits will simplify the counter-snapshotting algorithm
and convert a number of the counters that are now protected by the
new ->expedited_mutex to non-atomic.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ paulmck: Kept stop_one_cpu(), dropped disabling of "guardrails". ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 13bd6494 04-Jun-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Reset rcu_fanout_leaf if out of bounds

Currently if the rcu_fanout_leaf boot parameter is out of bounds (that
is, less than RCU_FANOUT_LEAF or greater than the number of bits in an
unsigned long), a warning is issued and execution continues with the
out-of-bounds value. This can result in all manner of failures, so this
patch resets rcu_fanout_leaf to RCU_FANOUT_LEAF when an out-of-bounds
condition is detected.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# cb007102 03-Jun-2015 Alexander Gordeev <agordeev@redhat.com>

rcu: Limit count of static data to the number of RCU levels

Although a number of RCU levels may be less than the current
maximum of four, some static data associated with each level
are allocated for all four levels. As result, the extra data
never get accessed and just wast memory. This update limits
count of allocated items to the number of used RCU levels.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 199977bf 03-Jun-2015 Alexander Gordeev <agordeev@redhat.com>

rcu: Remove unnecessary fields from rcu_state structure

Members rcu_state::levelcnt[] and rcu_state::levelspread[]
are only used at init. There is no reason to keep them
afterwards.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 05b84aec 03-Jun-2015 Alexander Gordeev <agordeev@redhat.com>

rcu: Limit rcu_capacity[] size to RCU_NUM_LVLS items

Number of items in rcu_capacity[] array is defined by macro
MAX_RCU_LVLS. However, that array is never accessed beyond
RCU_NUM_LVLS index. Therefore, we can limit the array to
RCU_NUM_LVLS items and eliminate MAX_RCU_LVLS. As result,
in most cases the memory is conserved.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9618138b 03-Jun-2015 Alexander Gordeev <agordeev@redhat.com>

rcu: Simplify rcu_init_geometry() capacity arithmetics

Current code suggests that introducing the extra level to
rcu_capacity[] array makes some of the arithmetic easier.
Well, in fact it appears rather confusing and unnecessary.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 679f9858 03-Jun-2015 Alexander Gordeev <agordeev@redhat.com>

rcu: Cleanup rcu_init_geometry() code and arithmetics

This update simplifies rcu_init_geometry() code flow
and makes calculation of the total number of rcu_node
structures more easy to read.

The update relies on the fact num_rcu_lvl[] is never
accessed beyond rcu_num_lvls index by the rest of the
code. Therefore, there is no need initialize the whole
num_rcu_lvl[].

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 372b0ec2 03-Jun-2015 Alexander Gordeev <agordeev@redhat.com>

rcu: Remove superfluous local variable in rcu_init_geometry()

Local variable 'n' mimics 'nr_cpu_ids' while the both are
used within one function. There is no reason for 'n' to
exist whatsoever.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 75cf15a4 03-Jun-2015 Alexander Gordeev <agordeev@redhat.com>

rcu: Panic if RCU tree can not accommodate all CPUs

Currently a condition when RCU tree is unable to accommodate
the configured number of CPUs is not permitted and causes
a fall back to compile-time values. However, the code has no
means to exceed the RCU tree capacity neither at compile-time
nor in run-time. Therefore, if the condition is met in run-
time then it indicates a serios problem elsewhere and should
be handled with a panic.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 319362c9 19-May-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide more diagnostics for stalled GP kthread

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d1ec4c34 13-May-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Drop RCU_USER_QS in favor of NO_HZ_FULL

The RCU_USER_QS Kconfig parameter is now just a synonym for NO_HZ_FULL,
so this commit eliminates RCU_USER_QS, replacing all uses with NO_HZ_FULL.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>


# 1ce46ee5 06-May-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Conditionally compile RCU's eqs warnings

This commit applies some warning-omission micro-optimizations to RCU's
various extended-quiescent-state functions, which are on the kernel/user
hotpath for CONFIG_NO_HZ_FULL=y.

Reported-by: Rik van Riel <riel@redhat.com>
Reported by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 26730f55 21-Apr-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Make RCU able to tolerate undefined CONFIG_RCU_KTHREAD_PRIO

This commit updates the initialization of the kthread_prio boot parameter
so that RCU will build even when CONFIG_RCU_KTHREAD_PRIO is undefined.
The kthread_prio boot parameter is set to CONFIG_RCU_KTHREAD_PRIO if
that is defined, otherwise to 1 if CONFIG_RCU_BOOST is defined and
to zero otherwise. This commit then makes CONFIG_RCU_KTHREAD_PRIO
depend on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
CONFIG_RCU_KTHREAD_PRIO unless they want to be.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 47d631af 21-Apr-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAF

This commit introduces an RCU_FANOUT_LEAF C-preprocessor macro so
that RCU will build even when CONFIG_RCU_FANOUT_LEAF is undefined.
The RCU_FANOUT_LEAF macro is set to the value of CONFIG_RCU_FANOUT_LEAF
when defined, otherwise it is set to 32 for 32-bit systems and 64 for
64-bit systems. This commit then makes CONFIG_RCU_FANOUT_LEAF depend
on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
CONFIG_RCU_FANOUT_LEAF unless they want to be.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 05c5df31 20-Apr-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT

This commit introduces an RCU_FANOUT C-preprocessor macro so that RCU will
build even when CONFIG_RCU_FANOUT is undefined. The RCU_FANOUT macro is
set to the value of CONFIG_RCU_FANOUT when defined, otherwise it is set
to 32 for 32-bit systems and 64 for 64-bit systems. This commit then
makes CONFIG_RCU_FANOUT depend on CONFIG_RCU_EXPERT, so that Kconfig
users won't be asked about CONFIG_RCU_FANOUT unless they want to be.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# a3dc2948 20-Apr-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Enable diagnostic dump of rcu_node combining tree

The purpose of this commit is to make it easier to verify that RCU's
combining tree is set up correctly, which is useful to have when making
changes in how that tree is initialized.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
[ paulmck: Fold fix found by Fengguang's 0-day test robot. ]


# 7fa27001 20-Apr-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert CONFIG_RCU_FANOUT_EXACT to boot parameter

The CONFIG_RCU_FANOUT_EXACT Kconfig parameter is used primarily (and
perhaps only) by rcutorture to verify that RCU works correctly in specific
rcu_node combining-tree configurations. It therefore does not make
much sense have this as a question to people attempting to configure
their kernels. So this commit creates an rcutree.rcu_fanout_exact=
boot parameter that rcutorture can use, and eliminates the original
CONFIG_RCU_FANOUT_EXACT Kconfig parameter.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 0f41c0dd 10-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide diagnostic option to slow down grace-period scans

Grace-period scans of the rcu_node combining tree normally
proceed quite quickly, so that it is very difficult to reproduce
races against them. This commit therefore allows grace-period
pre-initialization and cleanup to be artificially slowed down,
increasing race-reproduction probability. A pair of pairs of new
Kconfig parameters are provided, RCU_TORTURE_TEST_SLOW_PREINIT to
enable the slowing down of propagating CPU-hotplug changes up the
combining tree along with RCU_TORTURE_TEST_SLOW_PREINIT_DELAY to
specify the delay in jiffies, and RCU_TORTURE_TEST_SLOW_CLEANUP
to enable the slowing down of the end-of-grace-period cleanup scan
along with RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY to specify the delay
in jiffies. Boot-time parameters named rcutree.gp_preinit_delay and
rcutree.gp_cleanup_delay allow these delays to be specified at boot time.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3eaaaf6c 09-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Shut up spurious gcc uninitialized-variable warning

Because gcc doesn't realize that rcu_num_lvls must be strictly greater
than zero, some versions give a spurious warning about levelcnt[0] being
uninitialized in rcu_init_one(). This commit updates the condition on
the pre-existing panic() in order to educate gcc on this point.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# eab128e8 15-Apr-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Modulate grace-period slow init to normalize delay

Currently, the larger the gp_init_delay boot parameter, the slower
rcutorture will sequence through grace periods. This commit avoids this
issue by decreasing the probability of slowing initialization of a given
grace period as the degree of slowness increases.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a738eec6 10-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Correctly initialize ->rcu_qs_ctr_snap at online time

The rcu_data structure's ->rcu_qs_ctr_snap field is initialized at
CPU-online time from the current CPU's element of the per-CPU rcu_qs_ctr
variable. Unfortunately, this is at CPU_UP_PREPARE time, so has nothing
to do with the CPU being onlined. This commit therefore initializes
this variable from the incoming CPU's element of rcu_qs_ctr.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# cce7f1fc 09-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove redundant offline check

Because offline CPUs are propagated up the rcu_node tree's ->qsmaskinit
bits just before each grace period starts, the ->qsmaskinit bit cannot
be clear when the corresponding ->qsmask bit is set. Furthermore, this
condition used to correspond to a CPU that was on its way offline, and
making RCU's notion of an offline CPU more precise has eliminated this
situation. This commit therefore removes the now-redundant offline
check from force_qs_rnp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c5b55395 09-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove dead code from force_qs_rnp()

Because force_qs_rnp() is invoked only from the force-quiescent-state
code which runs only in the context of the grace-period kthread, a grace
period must always be in progress throughout force_qs_rnp()'s execution.
This commit therefore removes the rcu_gp_in_progress() check and the
associated dead code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ea46351c 03-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate HOTPLUG_CPU #ifdef in favor of IS_ENABLED()

This commit removes a HOTPLUG_CPU #ifdef, replacing it with
IS_ENABLED()-protected return statements. This relies on the
optimizer to remove any resulting dead code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 82072c4f 11-May-2015 Nicholas Mc Guire <hofrat@osadl.org>

rcu: Change function declaration to bool

rcu_cpu_has_callbacks() is declared int. The current declaration was introduced
in commit c0f4dfd4f90f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
callbacks). But it is actually returning bool and as the function description
states " * Return true if the specified CPU has any callback....", this probably
should be a bool as all (3) call-sites currently treat it as bool.

Type-checking coccinelle spatches are being used to locate type mismatches
between function signatures and return values in this case this produced:
./kernel/rcu/tree.c:3538 WARNING: return of wrong type
int != bool,

Patch was compile tested with x86_64_defconfig (implies CONFIG_TREE_RCU=y)

Patch is against 4.1-rc3 (localversion-next is -next-20150511) and fixes

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c92fb057 05-May-2015 Nicolas Iooss <nicolas.iooss_linux@m4x.org>

rcu: Make rcu_*_data variables static

rcu_bh_data, rcu_sched_data and rcu_preempt_data are never used outside
kernel/rcu/tree.c and thus can be made static.

Doing so fixes a section mismatch warning reported by clang when
building LLVMLinux with -Wsection, because these variables were declared
in .data..percpu and defined in .data..percpu..shared_aligned since
commit 11bbb235c26f ("rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for
rcu_data").

Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 30ff1533 01-May-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Make synchronize_sched_expedited() call wait_rcu_gp()

Currently, synchronize_sched_expedited() will call synchronize_sched()
if there is danger of counter wrap. But if configuration says to
always do expedited grace periods, synchronize_sched() will just
call synchronize_sched_expedited() right back again. In theory,
the old expedited operations will complete, the counters will
get back in synch, and the recursion will end. But we could
easily run out of stack long before that time. This commit
therefore makes synchronize_sched_expedited() invoke the underlying
wait_rcu_gp(call_rcu_sched) instead of synchronize_sched(), the same as
all the other calls out from synchronize_sched_expedited().

This bug was introduced by commit 1924bcb02597 (Avoid counter wrap in
synchronize_sched_expedited()).

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 81e701e43 16-Apr-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add more debug info on "kthread starved" RCU CPU stall warnings

This commit adds grace number and command-flags information to the
"kthread starved" message that is sometimes printed out as part of
RCU CPU stall warnings. This message is caused by the corresponding
RCU grace-period kthread not having run for at least two seconds, and
this added information can be helpful when debugging.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# cd73ca21 16-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Force wakeup of rcu_gp_kthread at grace-period end

The rcu_gp_kthread_wake() refuses to do a wakeup unless at least
one of the ->gp_flags bits are set, which normally will not be the
case when the last quiescent state is reported. This results in
up to a 3-jiffy delay given default Kconfig settings. This commit
therefore has rcu_report_qs_rsp() set RCU_GP_FLAG_FQS before invoking
rcu_gp_kthread_wake() in order to force a more immediate wakeup at
grace-period end, thus reducing grace-period latencies.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2927a689 04-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Create an immutable rcu_data_p pointer to default rcu_data structure

This commit creates an immutable rcu_data_p pointer that references
rcu_preempt_data for TREE_PREEMPT_RCU builds and that references
rcu_sched_data for TREE_RCU builds. This rcu_data_p pointer will enable
more code to move from #ifdef to IS_ENABLED().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b28a7c01 04-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Tell the compiler that rcu_state_p is immutable

This commit adds a "const" tag to the declarations of rcu_state_p,
which should allow the compiler to generate better code and also to
catch erroneous assignments to this variable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 7d0ae808 03-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()

This commit moves from the old ACCESS_ONCE() API to the new READ_ONCE()
and WRITE_ONCE() APIs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Updated to include kernel/torture.c as suggested by Jason Low. ]


# af658dca 29-Apr-2015 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

tracing: Rename ftrace_event.h to trace_events.h

The term "ftrace" is really the infrastructure of the function hooks,
and not the trace events. Rename ftrace_event.h to trace_events.h to
represent the trace_event infrastructure and decouple the term ftrace
from it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 8d7dc928 14-Apr-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Control grace-period delays directly from value

In a misguided attempt to avoid an #ifdef, the use of the
gp_init_delay module parameter was conditioned on the corresponding
RCU_TORTURE_TEST_SLOW_INIT Kconfig variable, using IS_ENABLED() at
the point of use in the code. This meant that the compiler always saw
the delay, which meant that RCU_TORTURE_TEST_SLOW_INIT_DELAY had to be
unconditionally defined. This in turn caused "make oldconfig" to ask
pointless questions about the value of RCU_TORTURE_TEST_SLOW_INIT_DELAY
in cases where it was not even used.

This commit avoids these pointless questions by defining gp_init_delay
under #ifdef. In one branch, gp_init_delay is initialized to
RCU_TORTURE_TEST_SLOW_INIT_DELAY and is also a module parameter (thus
allowing boot-time modification), and in the other branch gp_init_delay
is a const variable initialized by default to zero.

This approach also simplifies the code at the delay point by eliminating
the IS_DEFINED(). Because gp_init_delay is constant zero in the no-delay
case intended for production use, the "gp_init_delay > 0" check causes
the delay to become dead code, as desired in this case. In addition,
this commit replaces magic constant "10" with the preprocessor variable
PER_RCU_NODE_PERIOD, which controls the number of grace periods that
are allowed to elapse at full speed before a delay is inserted.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 654e9533 15-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Associate quiescent-state reports with grace period

As noted in earlier commit logs, CPU hotplug operations running
concurrently with grace-period initialization can result in a given
leaf rcu_node structure having all CPUs offline and no blocked readers,
but with this rcu_node structure nevertheless blocking the current
grace period. Therefore, the quiescent-state forcing code now checks
for this situation and repairs it.

Unfortunately, this checking can result in false positives, for example,
when the last task has just removed itself from this leaf rcu_node
structure, but has not yet started clearing the ->qsmask bits further
up the structure. This means that the grace-period kthread (which
forces quiescent states) and some other task might be attempting to
concurrently clear these ->qsmask bits. This is usually not a problem:
One of these tasks will be the first to acquire the upper-level rcu_node
structure's lock and with therefore clear the bit, and the other task,
seeing the bit already cleared, will stop trying to clear bits.

Sadly, this means that the following unusual sequence of events -can-
result in a problem:

1. The grace-period kthread wins, and clears the ->qsmask bits.

2. This is the last thing blocking the current grace period, so
that the grace-period kthread clears ->qsmask bits all the way
to the root and finds that the root ->qsmask field is now zero.

3. Another grace period is required, so that the grace period kthread
initializes it, including setting all the needed qsmask bits.

4. The leaf rcu_node structure (the one that started this whole
mess) is blocking this new grace period, either because it
has at least one online CPU or because there is at least one
task that had blocked within an RCU read-side critical section
while running on one of this leaf rcu_node structure's CPUs.
(And yes, that CPU might well have gone offline before the
grace period in step (3) above started, which can mean that
there is a task on the leaf rcu_node structure's ->blkd_tasks
list, but ->qsmask equal to zero.)

5. The other kthread didn't get around to trying to clear the upper
level ->qsmask bits until all the above had happened. This means
that it now sees bits set in the upper-level ->qsmask field, so it
proceeds to clear them. Too bad that it is doing so on behalf of
a quiescent state that does not apply to the current grace period!

This sequence of events can result in the new grace period being too
short. It can also result in the new grace period ending before the
leaf rcu_node structure's ->qsmask bits have been cleared, which will
result in splats during initialization of the next grace period. In
addition, it can result in tasks blocking the new grace period still
being queued at the start of the next grace period, which will result
in other splats. Sasha's testing turned up another of these splats,
as did rcutorture testing. (And yes, rcutorture is being adjusted to
make these splats show up more quickly. Which probably is having the
undesirable side effect of making other problems show up less quickly.
Can't have everything!)

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.0.x
Tested-by: Sasha Levin <sasha.levin@oracle.com>


# a77da14c 08-Mar-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Yet another fix for preemption and CPU hotplug

As noted earlier, the following sequence of events can occur when
running PREEMPT_RCU and HOTPLUG_CPU on a system with a multi-level
rcu_node combining tree:

1. A group of tasks block on CPUs corresponding to a given leaf
rcu_node structure while within RCU read-side critical sections.
2. All CPUs corrsponding to that rcu_node structure go offline.
3. The next grace period starts, but because there are still tasks
blocked, the upper-level bits corresponding to this leaf rcu_node
structure remain set.
4. All the tasks exit their RCU read-side critical sections and
remove themselves from the leaf rcu_node structure's list,
leaving it empty.
5. But because there now is code to check for this condition at
force-quiescent-state time, the upper bits are cleared and the
grace period completes.

However, there is another complication that can occur following step 4 above:

4a. The grace period starts, and the leaf rcu_node structure's
gp_tasks pointer is set to NULL because there are no tasks
blocked on this structure.
4b. One of the CPUs corresponding to the leaf rcu_node structure
comes back online.
4b. An endless stream of tasks are preempted within RCU read-side
critical sections on this CPU, such that the ->blkd_tasks
list is always non-empty.

The grace period will never end.

This commit therefore makes the force-quiescent-state processing check only
for absence of tasks blocking the current grace period rather than absence
of tasks altogether. This will cause a quiescent state to be reported if
the current leaf rcu_node structure is not blocking the current grace period
and its parent thinks that it is, regardless of how RCU managed to get
itself into this state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.0.x
Tested-by: Sasha Levin <sasha.levin@oracle.com>


# 5c60d25f 09-Feb-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Add diagnostics to grace-period cleanup

At grace-period initialization time, RCU checks that all quiescent
states were really reported for the previous grace period. Now that
grace-period cleanup has been split out of grace-period initialization,
this commit also performs those checks at grace-period cleanup time.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 88428cc5 28-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Handle outgoing CPUs on exit from idle loop

This commit informs RCU of an outgoing CPU just before that CPU invokes
arch_cpu_idle_dead() during its last pass through the idle loop (via a
new CPU_DYING_IDLE notifier value). This change means that RCU need not
deal with outgoing CPUs passing through the scheduler after informing
RCU that they are no longer online. Note that removing the CPU from
the rcu_node ->qsmaskinit bit masks is done at CPU_DYING_IDLE time,
and orphaning callbacks is still done at CPU_DEAD time, the reason being
that at CPU_DEAD time we have another CPU that can adopt them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# c1990689 23-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate ->onoff_mutex from rcu_node structure

Because that RCU grace-period initialization need no longer exclude
CPU-hotplug operations, this commit eliminates the ->onoff_mutex and
its uses.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 0aa04b05 23-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Process offlining and onlining only at grace-period start

Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events. Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.

This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures. When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree. The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field. In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared. If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.

This of course means that RCU's notion of which CPUs are offline can be
out of date. This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started. In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# cc99a310 23-Feb-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Move rcu_report_unblock_qs_rnp() to common code

The rcu_report_unblock_qs_rnp() function is invoked when the
last task blocking the current grace period exits its outermost
RCU read-side critical section. Previously, this was called only
from rcu_read_unlock_special(), and was therefore defined only when
CONFIG_RCU_PREEMPT=y. However, this function will be invoked even when
CONFIG_RCU_PREEMPT=n once CPU-hotplug operations are processed only at
the beginnings of RCU grace periods. The reason for this change is that
the last task on a given leaf rcu_node structure's ->blkd_tasks list
might well exit its RCU read-side critical section between the time that
recent CPU-hotplug operations were applied and when the new grace period
was initialized. This situation could result in RCU waiting forever on
that leaf rcu_node structure, because if all that structure's CPUs were
already offline, there would be no quiescent-state events to drive that
structure's part of the grace period.

This commit therefore moves rcu_report_unblock_qs_rnp() to common code
that is built unconditionally so that the quiescent-state-forcing code
can clean up after this situation, avoiding the grace-period stall.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 999c2863 31-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs

Offline CPUs cannot safely invoke trace events, but such CPUs do execute
within rcu_cpu_notify(). Therefore, this commit removes the trace events
from rcu_cpu_notify(). These trace events are for utilization, against
which rcu_cpu_notify() execution time should be negligible.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 37745d28 22-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide diagnostic option to slow down grace-period initialization

Grace-period initialization normally proceeds quite quickly, so
that it is very difficult to reproduce races against grace-period
initialization. This commit therefore allows grace-period
initialization to be artificially slowed down, increasing
race-reproduction probability. A pair of new Kconfig parameters are
provided, CONFIG_RCU_TORTURE_TEST_SLOW_INIT to enable the slowdowns, and
CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY to specify the number of jiffies
of slowdown to apply. A boot-time parameter named rcutree.gp_init_delay
allows boot-time delay to be specified. By default, no delay will be
applied even if CONFIG_RCU_TORTURE_TEST_SLOW_INIT is set.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 237a0f21 22-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Detect stalls caused by failure to propagate up rcu_node tree

If all CPUs have passed through quiescent states, then stalls might be
due to starvation of the grace-period kthread or to failure to propagate
the quiescent states up the rcu_node combining tree. The current stall
warning messages do not differentiate, so this commit adds a printout
of the root rcu_node structure's ->qsmask field.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 78043c46 18-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Put all orphan-callback-related code under same comment

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b33078b6 16-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Consolidate offline-CPU callback initialization

Currently, both rcu_cleanup_dead_cpu() and rcu_send_cbs_to_orphanage()
initialize the outgoing CPU's callback list. However, only
rcu_cleanup_dead_cpu() invokes rcu_send_cbs_to_orphanage(), and
it does so unconditionally, which means that only one of these
initializations is required. This commit therefore consolidates the
callback-list initialization with the rest of the callback handling in
rcu_send_cbs_to_orphanage().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9910affa 25-Feb-2015 Yao Dongdong <yaodongdong@huawei.com>

rcu: Remove redundant check of cpu_online()

Because invoke_cpu_core() checks whether the current CPU is online,
there is no need for __call_rcu_core() to redundantly check it.
There should not be any performance degradation because the called
function is visible to the compiler. This commit therefore removes
the redundant check.

Signed-off-by: Yao Dongdong <yaodongdong@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e7580f33 24-Feb-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Get rcu_sched_force_quiescent_state() where it belongs

The very similar functions rcu_force_quiescent_state(),
rcu_bh_force_quiescent_state(), and rcu_sched_force_quiescent_state()
are supposed to be together, but have drifted apart. This commit
restores rcu_sched_force_quiescent_state() to its rightful place.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 66292405 19-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Use IS_ENABLED() to CONFIG_RCU_FANOUT_EXACT #ifdef

This commit uses IS_ENABLED() to remove the #ifdef from the
rcu_init_levelspread() functions. No effect on executable code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 47627678 19-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Move early boot callback tests earlier

Because callbacks can now be posted quite early in boot, move the
early boot callback tests to precede RCU initialization.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 34404ca8 19-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Move early-boot callbacks to no-CBs lists for no-CBs CPUs

When a CPU is first determined to be a no-CBs CPUs, this commit causes
any early boot callbacks to be moved to the no-CBs callback list,
allowing them to be invoked.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5871968d 24-Feb-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Tighten up affinity and check for sysidle

If the RCU grace-period kthread invoking rcu_sysidle_check_cpu()
happens to be running on the tick_do_timer_cpu initially,
then rcu_bind_gp_kthread() won't bind it. This kthread might
then migrate before invoking rcu_gp_fqs(), which will trigger the
WARN_ON_ONCE() in rcu_sysidle_check_cpu(). This commit therefore makes
rcu_bind_gp_kthread() do the binding even if the kthread is currently
on the same CPU. Because this incurs added overhead, this commit also
causes each RCU grace-period kthread to invoke rcu_bind_gp_kthread()
once at boot rather than at the beginning of each grace period.
And as long as rcu_bind_gp_kthread() is being modified, this commit
eliminates its #ifdef.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 675da67f 23-Feb-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Fixes to NO_HZ_FULL sysidle accounting

On second and subsequent passes through quiescent-state forcing, the
isidle variable was initialized to false, which would prevent full sysidle
state from being reached if a grace period needed more than one round
of quiescent-state forcing (which most should not). However, the check
for offline CPUs in the quiescent-state forcing main loop had the wrong
sense, which could prevent CPUs from ever entering full sysidle state.

This commit fixes both of these bugs. Given that sysidle is not yet
wired up, this has no effect in old kernels, but might have proven
frustrating had anyone attempted to wire it up.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5afff48b 18-Feb-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Update from rcu_expedited variable to rcu_gp_is_expedited()

This commit updates open-coded tests of the rcu_expedited variable
to instead use rcu_gp_is_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 1925d196 18-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix a couple of typos in rcu_all_qs() comment header

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 39c8d313 21-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid clobbering early boot callbacks

When a CPU comes online, it initializes its callback list. This
is a bad thing if this is the first time that the CPU has come
online and if that CPU has early boot callbacks. This commit therefore
avoid initializing the callback list if there are callbacks present,
in which case the initial call_rcu() did the initialization for us.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 143da9c2 19-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Prevent early-boot RCU callbacks from splatting

Currently, a call_rcu() that precedes rcu_init() will splat due to the
callback lists not having yet been initialized. This commit causes the
first such callback to initialize the boot CPU's RCU callback list.

Note that this commit does not change rcu_init()-time initialization,
which means that the callback will be discarded at rcu_init() time.
Fixing this is the job of later commits.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2723249a 20-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Wire ->rda pointers at compile time

This commit wires up the rcu_state structures' ->rda pointers to the
per-CPU rcu_data structures at compile time, thus ensuring that this
linkage is present at early boot, in turn allowing posting of callbacks
before rcu_init() is executed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d3f3f3f2 18-Jan-2015 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract default callback-list initialization from init_callback_list()

In preparation for early-boot posting of callbacks, this commit abstracts
initialization of the default (non-no-CB) callbacks list from the
init_callback_list() function into a new init_default_callback_list()
function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3f47da0f 13-Jan-2015 Lai Jiangshan <laijs@cn.fujitsu.com>

rcu_tree: Avoid touching rnp->completed when a new GP is started

In rcu_gp_init(), rnp->completed equals to rsp->completed in THEORY,
we don't need to touch it normally. If something goes wrong,
it will complain and fixup rnp->completed and avoid oops.
This commit thus avoids the normal needless store to rnp->completed.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fb81a44b 17-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Add GP-kthread-starvation checks to CPU stall warnings

This commit adds a message that is printed if the relevant grace-period
kthread has not been able to run for the two seconds preceding the
stall warning. (The two seconds is double the maximum interval between
successive bouts of quiescent-state forcing.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5cd37193 13-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors

Although cond_resched_rcu_qs() only applies to TASKS_RCU, it is used
in places where it would be useful for it to apply to the normal RCU
flavors, rcu_preempt, rcu_sched, and rcu_bh. This is especially the
case for workloads that aggressively overload the system, particularly
those that generate large numbers of RCU updates on systems running
NO_HZ_FULL CPUs. This commit therefore communicates quiescent states
from cond_resched_rcu_qs() to the normal RCU flavors.

Note that it is unfortunately necessary to leave the old ->passed_quiesce
mechanism in place to allow quiescent states that apply to only one
flavor to be recorded. (Yes, we could decrement ->rcu_qs_ctr_snap in
that case, but that is not so good for debugging of RCU internals.)
In addition, if one of the RCU flavor's grace period has stalled, this
will invoke rcu_momentary_dyntick_idle(), resulting in a heavy-weight
quiescent state visible from other CPUs.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Merge commit from Sasha Levin fixing a bug where __this_cpu()
was used in preemptible code. ]


# a94844b2 12-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Optionally run grace-period kthreads at real-time priority

Recent testing has shown that under heavy load, running RCU's grace-period
kthreads at real-time priority can improve performance (according to 0day
test robot) and reduce the incidence of RCU CPU stall warnings. However,
most systems do just fine with the default non-realtime priorities for
these kthreads, and it does not make sense to expose the entire user
base to any risk stemming from this change, given that this change is
of use only to a few users running extremely heavy workloads.

Therefore, this commit allows users to specify realtime priorities
for the grace-period kthreads, but leaves them running SCHED_OTHER
by default. The realtime priority may be specified at build time
via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
rcutree.kthread_prio parameter. Either way, 0 says to continue the
default SCHED_OTHER behavior and values from 1-99 specify that priority
of SCHED_FIFO behavior. Note that a value of 0 is not permitted when
the RCU_BOOST Kconfig parameter is specified.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 917963d0 21-Nov-2014 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Check from beginning to end of grace period

Currently, rcutorture's Reader Batch checks measure from the end of
the previous grace period to the end of the current one. This commit
tightens up these checks by measuring from the start and end of the same
grace period. This involves adding rcu_batches_started() and friends
corresponding to the existing rcu_batches_completed() and friends.

We leave SRCU alone for the moment, as it does not yet have a way of
tracking both ends of its grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9733e4f0 21-Nov-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Make _batches_completed() functions return unsigned long

Long ago, the various ->completed fields were of type long, but now are
unsigned long due to signed-integer-overflow concerns. However, the
various _batches_completed() functions remained of type long, even though
their only purpose in life is to return the corresponding ->completed
field. This patch cleans this up by changing these functions' return
types to unsigned long.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e3663b10 08-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Handle gpnum/completed wrap while dyntick idle

Subtle race conditions can result if a CPU stays in dyntick-idle mode
long enough for the ->gpnum and ->completed fields to wrap. For
example, consider the following sequence of events:

o CPU 1 encounters a quiescent state while waiting for grace period
5 to complete, but then enters dyntick-idle mode.

o While CPU 1 is in dyntick-idle mode, the grace-period counters
wrap around so that the grace period number is now 4.

o Just as CPU 1 exits dyntick-idle mode, grace period 4 completes
and grace period 5 begins.

o The quiescent state that CPU 1 passed through during the old
grace period 5 looks like it applies to the new grace period
5. Therefore, the new grace period 5 completes without CPU 1
having passed through a quiescent state.

This could clearly be a fatal surprise to any long-running RCU read-side
critical section that happened to be running on CPU 1 at the time. At one
time, this was not a problem, given that it takes significant time for
the grace-period counters to overflow even on 32-bit systems. However,
with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long
idle periods are now becoming quite feasible. It is therefore time to
close this race.

This commit therefore avoids this race condition by having the
quiescent-state forcing code detect when a CPU is falling too far
behind, and setting a new rcu_data field ->gpwrap when this happens.
Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed
fields are known to be untrustworthy, and can be ignored, along with
any associated quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6ccd2ecd 11-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Improve diagnostics for spurious RCU CPU stall warnings

The current RCU CPU stall warning code will print "Stall ended before
state dump start" any time that the stall-warning code is triggered on
a CPU that has already reported a quiescent state for the current grace
period and if all quiescent states have been reported for the current
grace period. However, a true stall can result in these symptoms, for
example, by preventing RCU's grace-period kthreads from ever running

This commit therefore checks for this condition, reporting the end of
the stall only if one of the grace-period counters has actually advanced.
Otherwise, it reports the last time that the grace-period kthread made
meaningful progress. (In normal situations, the grace-period kthread
should make meaningful progress at least every jiffies_till_next_fqs
jiffies.)

Reported-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Miroslav Benes <mbenes@suse.cz>


# fc908ed3 08-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Make RCU_CPU_STALL_INFO include number of fqs attempts

One way that an RCU CPU stall warning can happen is if the grace-period
kthread is not allowed to execute. One proxy for this kthread's
forward progress is the number of force-quiescent-state (fqs) scans.
This commit therefore adds the number of fqs scans to the RCU CPU stall
warning printouts when CONFIG_RCU_CPU_STALL_INFO=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ab954c16 17-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove redundant callback-list initialization

The RCU callback lists are initialized in both rcu_boot_init_percpu_data()
and rcu_init_percpu_data(). The former is intended for initializing
immutable data, so this commit removes the initialization from
rcu_boot_init_percpu_data() and leaves it in rcu_init_percpu_data().
This change prepares for permitting callbacks to be queued very early
in boot.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6cd534ef 10-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't scan root rcu_node structure for stalled tasks

Now that blocked tasks are no longer migrated to the root rcu_node
structure, there is no need to scan the root rcu_node structure for
blocked tasks stalling the current grace period. This commit therefore
removes this scan.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3ba4d0e0 12-Nov-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Note quiescent state when CPU goes offline

The rcu_cleanup_dead_cpu() function (called after a CPU has gone
completely offline) has not reported a quiescent state because there
was probably at least one synchronize_rcu() between the time the CPU
went offline and the CPU_DEAD notifier, and this would have detected
the CPU's offline state via quiescent-state forcing. However, the plan
is for CPUs to take themselves offline, at which point it makes sense
for them to report their own quiescent state. This commit makes this
change in preparation for the new CPU-hotplug setup.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 1be0085b 03-Nov-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't initiate RCU priority boosting on root rcu_node

Because there is no longer any preempted tasks on the root rcu_node, and
because there is no longer ever an rcub kthread for the root rcu_node,
this commit drops the code in force_qs_rnp() that attempts to awaken
the non-existent root rcub kthread. This is strictly a performance
enhancement, removing a root rcu_node ->lock acquisition and release
along with some tests in rcu_initiate_boost(), ending with the test that
notes that there is no rcub kthread.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a8f4cbad 31-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Shorten irq-disable region in rcu_cleanup_dead_cpu()

Now that we are not migrating callbacks, there is no need to hold the
->orphan_lock across the the ->qsmaskinit bit-clearing process.
This commit therefore releases ->orphan_lock immediately after adopting
the orphaned RCU callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d19fb8d1 31-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't migrate blocked tasks even if all corresponding CPUs offline

When the last CPU associated with a given leaf rcu_node structure
goes offline, something must be done about the tasks queued on that
rcu_node structure. Each of these tasks has been preempted on one of
the leaf rcu_node structure's CPUs while in an RCU read-side critical
section that it have not yet exited. Handling these tasks is the job of
rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node
structure to the root rcu_node structure.

Unfortunately, this migration has to be done one task at a time because
each tasks allegiance must be shifted from the original leaf rcu_node to
the root, so that future attempts to deal with these tasks will acquire
the root rcu_node structure's ->lock rather than that of the leaf.
Worse yet, this migration must be done with interrupts disabled, which
is not so good for realtime response, especially given that there is
no bound on the number of tasks on a given rcu_node structure's list.
(OK, OK, there is a bound, it is just that it is unreasonably large,
especially on 64-bit systems.) This was not considered a problem back
when rcu_preempt_offline_tasks() was first written because realtime
systems were assumed not to do CPU-hotplug operations while real-time
applications were running. This assumption has proved of dubious validity
given that people are starting to run multiple realtime applications
on a single SMP system and that it is common practice to offline then
online a CPU before starting its real-time application in order to clear
extraneous processing off of that CPU. So we now need CPU hotplug
operations to avoid undue latencies.

This commit therefore avoids migrating these tasks, instead letting
them be dequeued one by one from the original leaf rcu_node structure
by rcu_read_unlock_special(). This means that the clearing of bits
from the upper-level rcu_node structures must be deferred until the
last such task has been dequeued, because otherwise subsequent grace
periods won't wait on them. This commit has the beneficial side effect
of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in
CONFIG_RCU_BOOST builds.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b6a932d1 31-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_read_unlock_special() propagate ->qsmaskinit bit clearing

This commit causes rcu_read_unlock_special() to propagate ->qsmaskinit
bit clearing up the rcu_node tree once a given rcu_node structure's
blkd_tasks list becomes empty. This is the final commit in preparation
for the rework of RCU priority boosting: It enables preempted tasks to
remain queued on their rcu_node structure even after all of that rcu_node
structure's CPUs have gone offline.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8af3a5e7 31-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()

This commit abstracts rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
in preparation for the rework of RCU priority boosting. This new function
will be invoked from rcu_read_unlock_special() in the reworked scheme,
which is why rcu_cleanup_dead_rnp() assumes that the leaf rcu_node
structure's ->qsmaskinit field has already been updated.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 41050a00 18-Dec-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix rcu_barrier() race that could result in too-short wait

The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions.
It checks a given CPU's lists of callbacks, and if all three no-CBs lists
are empty, ignores that CPU. However, these three lists could potentially
be empty even when callbacks are present if the check executed just as
the callbacks were being moved from one list to another. It turns out
that recent versions of rcutorture can spot this race.

This commit plugs this hole by consolidating the per-list counts of
no-CBs callbacks into a single count, which is incremented before
the corresponding callback is posted and after it is invoked. Then
rcu_barrier() checks this single count to reliably determine whether
the corresponding CPU has no-CBs callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 5f6130fa 09-Dec-2014 Lai Jiangshan <laijs@cn.fujitsu.com>

tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task

For RCU in UP, context-switch = QS = GP, thus we can force a
context-switch when any call_rcu_[bh|sched]() is happened on idle_task.
After doing so, rcu_idle/irq_enter/exit() are useless, so we can simply
make these functions empty.

More important, this change does not change the functionality logically.
Note: raise_softirq(RCU_SOFTIRQ)/rcu_sched_qs() in rcu_idle_enter() and
outmost rcu_irq_exit() will have to wake up the ksoftirqd
(due to in_interrupt() == 0).

Before this patch After this patch:
call_rcu_sched() in idle; call_rcu_sched() in idle
set resched
do other stuffs; do other stuffs
outmost rcu_irq_exit() outmost rcu_irq_exit() (empty function)
(or rcu_idle_enter()) (or rcu_idle_enter(), also empty function)
start to resched. (see above)
rcu_sched_qs() rcu_sched_qs()
QS,and GP and advance cb QS,and GP and advance cb
wake up the ksoftirqd wake up the ksoftirqd
set resched
resched to ksoftirqd (or other) resched to ksoftirqd (or other)

These two code patches are almost the same.

Size changed after patched:

size kernel/rcu/tiny-old.o kernel/rcu/tiny-patched.o
text data bss dec hex filename
3449 206 8 3663 e4f kernel/rcu/tiny-old.o
2406 144 8 2558 9fe kernel/rcu/tiny-patched.o

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 924df8a0 29-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix invoke_rcu_callbacks() comment

Despite what the comment says, it is only softirqs that are disabled,
not interrupts. This commit therefore fixes the comment.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 734d1680 21-Nov-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_nmi_enter() handle nesting

The x86 architecture has multiple types of NMI-like interrupts: real
NMIs, machine checks, and, for some values of NMI-like, debugging
and breakpoint interrupts. These interrupts can nest inside each
other. Andy Lutomirski is adding RCU support to these interrupts,
so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.

This commit therefore introduces nesting, using a clever NMI-coordination
algorithm suggested by Andy. The trick is to atomically increment
->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
(and, accordingly, after on exit). In addition, ->dynticks_nmi_nesting
is incremented by one if ->dynticks was incremented and by two otherwise.
This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
to one, it knows that ->dynticks must be atomically incremented.

This NMI-coordination algorithms has been validated by the following
Promela model:

------------------------------------------------------------------------

/*
* Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
* that allows nesting.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, you can access it online at
* http://www.gnu.org/licenses/gpl-2.0.html.
*
* Copyright IBM Corporation, 2014
*
* Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
*/

byte dynticks_nmi_nesting = 0;
byte dynticks = 0;

/*
* Promela verision of rcu_nmi_enter().
*/
inline rcu_nmi_enter()
{
byte incby;
byte tmp;

incby = BUSY_INCBY;
assert(dynticks_nmi_nesting >= 0);
if
:: (dynticks & 1) == 0 ->
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 1);
incby = 1;
:: else ->
skip;
fi;
tmp = dynticks_nmi_nesting;
tmp = tmp + incby;
dynticks_nmi_nesting = tmp;
assert(dynticks_nmi_nesting >= 1);
}

/*
* Promela verision of rcu_nmi_exit().
*/
inline rcu_nmi_exit()
{
byte tmp;

assert(dynticks_nmi_nesting > 0);
assert((dynticks & 1) != 0);
if
:: dynticks_nmi_nesting != 1 ->
tmp = dynticks_nmi_nesting;
tmp = tmp - BUSY_INCBY;
dynticks_nmi_nesting = tmp;
:: else ->
dynticks_nmi_nesting = 0;
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 0);
fi;
}

/*
* Base-level NMI runs non-atomically. Crudely emulates process-level
* dynticks-idle entry/exit.
*/
proctype base_NMI()
{
byte busy;

busy = 0;
do
:: /* Emulate base-level dynticks and not. */
if
:: 1 -> atomic {
dynticks = dynticks + 1;
}
busy = 1;
:: 1 -> skip;
fi;

/* Verify that we only sometimes have base-level dynticks. */
if
:: busy == 0 -> skip;
:: busy == 1 -> skip;
fi;

/* Model RCU's NMI entry and exit actions. */
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();

/* Emulated re-entering base-level dynticks and not. */
if
:: !busy -> skip;
:: busy ->
atomic {
dynticks = dynticks + 1;
}
busy = 0;
fi;

/* We had better now be in dyntick-idle mode. */
assert((dynticks & 1) == 0);
od;
}

/*
* Nested NMI runs atomically to emulate interrupting base_level().
*/
proctype nested_NMI()
{
do
:: /*
* Use an atomic section to model a nested NMI. This is
* guaranteed to interleave into base_NMI() between a pair
* of base_NMI() statements, just as a nested NMI would.
*/
atomic {
/* Verify that we only sometimes are in dynticks. */
if
:: (dynticks & 1) == 0 -> skip;
:: (dynticks & 1) == 1 -> skip;
fi;

/* Model RCU's NMI entry and exit actions. */
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
}
od;
}

init {
run base_NMI();
run nested_NMI();
}

------------------------------------------------------------------------

The following script can be used to run this model if placed in
rcu_nmi.spin:

------------------------------------------------------------------------

if ! spin -a rcu_nmi.spin
then
echo Spin errors!!!
exit 1
fi
if ! cc -DSAFETY -o pan pan.c
then
echo Compilation errors!!!
exit 1
fi
./pan -m100000

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>


# aa23c6fbc 19-Sep-2014 Pranith Kumar <bobby.prani@gmail.com>

rcutorture: Add early boot self tests

Add early boot self tests for RCU under CONFIG_PROVE_RCU.

Currently the only test is adding a dummy callback which increments a counter
which we then later verify after calling rcu_barrier*().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8fa7845d 22-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove "cpu" argument to rcu_cleanup_after_idle()

The "cpu" argument to rcu_cleanup_after_idle() is always the current
CPU, so drop it. This moves the smp_processor_id() from the caller to
rcu_cleanup_after_idle(), saving argument-passing overhead. Again,
the anticipated cross-CPU uses of these functions has been replaced
by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 198bbf81 22-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove "cpu" argument to rcu_prepare_for_idle()

The "cpu" argument to rcu_prepare_for_idle() is always the current
CPU, so drop it. This in turn allows two of the uses of "cpu" in
this function to be replaced with a this_cpu_ptr() and the third by
smp_processor_id(), replacing that of the call to rcu_prepare_for_idle().
Again, the anticipated cross-CPU uses of these functions has been replaced
by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# aa6da514 21-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove "cpu" argument to rcu_needs_cpu()

The "cpu" argument to rcu_needs_cpu() is always the current CPU, so drop
it. This in turn allows the "cpu" argument to rcu_cpu_has_callbacks()
to be removed, which allows the uses of "cpu" in both functions to be
replaced with a this_cpu_ptr(). Again, the anticipated cross-CPU uses
of these functions has been replaced by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 38200cf2 21-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove "cpu" argument to rcu_note_context_switch()

The "cpu" argument to rcu_note_context_switch() is always the current
CPU, so drop it. This in turn allows the "cpu" argument to
rcu_preempt_note_context_switch() to be removed, which allows the sole
use of "cpu" in both functions to be replaced with a this_cpu_ptr().
Again, the anticipated cross-CPU uses of these functions has been
replaced by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 86aea0e6 21-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()

Because rcu_preempt_check_callbacks()'s argument is guaranteed to
always be the current CPU, drop the argument and replace per_cpu()
with __this_cpu_read().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# e3950ecd 21-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove "cpu" argument to rcu_pending()

Because rcu_pending()'s argument is guaranteed to always be the current
CPU, drop the argument and replace per_cpu_ptr() with this_cpu_ptr().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# c3377c2d 21-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove "cpu" argument to rcu_check_callbacks()

The "cpu" argument was kept around on the off-chance that RCU might
offload scheduler-clock interrupts. However, this offload approach
has been replaced by NO_HZ_FULL, which offloads -all- RCU processing
from qualifying CPUs. It is therefore time to remove the "cpu" argument
to rcu_check_callbacks(), which this commit does.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 11bbb235 04-Sep-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data

The rcu_data per-CPU variable has a number of fields that are atomically
manipulated, potentially by any CPU. This situation can result in false
sharing with per-CPU variables that have the misfortune of being allocated
adjacent to rcu_data in memory. This commit therefore changes the
DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid
this false sharing.

Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 28ced795 02-Sep-2014 Christoph Lameter <cl@linux.com>

rcu: Remove rcu_dynticks * parameters when they are always this_cpu_ptr(&rcu_dynticks)

For some functions in kernel/rcu/tree* the rdtp parameter is always
this_cpu_ptr(rdtp). Remove the parameter if constant and calculate the
pointer in function.

This will have the advantage that it is obvious that the address are
all per cpu offsets and thus it will enable the use of this_cpu_ops in
the future.

Signed-off-by: Christoph Lameter <cl@linux.com>
[ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>


# 776d6807 23-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Kick rcuo kthreads after their CPU goes offline

If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads. This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e0775cef 03-Sep-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Avoid IPIing idle CPUs from synchronize_sched_expedited()

Currently, synchronize_sched_expedited() sends IPIs to all online CPUs,
even those that are idle or executing in nohz_full= userspace. Because
idle CPUs and nohz_full= userspace CPUs are in extended quiescent states,
there is no need to IPI them in the first place. This commit therefore
avoids IPIing CPUs that are already in extended quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 61cfd097 02-Sep-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Move RCU_BOOST variable declarations, eliminating #ifdef

There are some RCU_BOOST-specific per-CPU variable declarations that
are needlessly defined under #ifdef in kernel/rcu/tree.c. This commit
therefore moves these declarations into a pre-existing #ifdef in
kernel/rcu/tree_plugin.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# d7e29933 27-Oct-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Make rcu_barrier() understand about missing rcuo kthreads

Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.

Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>


# dd56af42 25-Aug-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate deadlock between CPU hotplug and expedited grace periods

Currently, the expedited grace-period primitives do get_online_cpus().
This greatly simplifies their implementation, but means that calls
to them holding locks that are acquired by CPU-hotplug notifiers (to
say nothing of calls to these primitives from CPU-hotplug notifiers)
can deadlock. But this is starting to become inconvenient, as can be
seen here: https://lkml.org/lkml/2014/8/5/754. The problem in this
case is that some developers need to acquire a mutex from a CPU-hotplug
notifier, but also need to hold it across a synchronize_rcu_expedited().
As noted above, this currently results in deadlock.

This commit avoids the deadlock and retains the simplicity by creating
a try_get_online_cpus(), which returns false if the get_online_cpus()
reference count could not immediately be incremented. If a call to
try_get_online_cpus() returns true, the expedited primitives operate as
before. If a call returns false, the expedited primitives fall back to
normal grace-period operations. This falling back of course results in
increased grace-period latency, but only during times when CPU hotplug
operations are actually in flight. The effect should therefore be
negligible during normal operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Tested-by: Lan Tianyu <tianyu.lan@intel.com>


# 35ce7f29 11-Jul-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Create rcuo kthreads only for onlined CPUs

RCU currently uses for_each_possible_cpu() to spawn rcuo kthreads,
which can result in more rcuo kthreads than one would expect, for
example, derRichard reported 64 CPUs worth of rcuo kthreads on an
8-CPU image. This commit therefore creates rcuo kthreads only for
those CPUs that actually come online.

This was reported by derRichard on the OFTC IRC network.

Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 9386c0b7 13-Jul-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Rationalize kthread spawning

Currently, RCU spawns kthreads from several different early_initcall()
functions. Although this has served RCU well for quite some time,
as more kthreads are added a more deterministic approach is required.
This commit therefore causes all of RCU's early-boot kthreads to be
spawned from a single early_initcall() function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 284a8c93 14-Aug-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Per-CPU operation cleanups to rcu_*_qs() functions

The rcu_bh_qs(), rcu_preempt_qs(), and rcu_sched_qs() functions use
old-style per-CPU variable access and write to ->passed_quiesce even
if it is already set. This commit therefore updates to use the new-style
per-CPU variable access functions and avoids the spurious writes.
This commit also eliminates the "cpu" argument to these functions because
they are always invoked on the indicated CPU.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 176f8f7a 04-Aug-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Make TASKS_RCU handle nohz_full= CPUs

Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
usermode execution. There would be neither a context switch nor a
scheduling-clock interrupt to tell TASKS_RCU that the task in question
had passed through a quiescent state. The grace period would therefore
extend indefinitely. This commit therefore makes RCU's dyntick-idle
subsystem record the task_struct structure of the task that is running
in dyntick-idle mode on each CPU. The TASKS_RCU grace period can
then access this information and record a quiescent state on
behalf of any CPU running in dyntick-idle usermode.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# bde6c3aa 01-Jul-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops

RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks. In some cases, this requires
instrumenting cond_resched(). However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks. Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 8315f422 27-Jun-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Add call_rcu_tasks()

This commit adds a new RCU-tasks flavor of RCU, which provides
call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
context switch (not preemption!) and userspace execution (not the idle
loop -- use some sort of schedule_on_each_cpu() if you need to handle the
idle tasks. Note that unlike other RCU flavors, these quiescent states
occur in tasks, not necessarily CPUs. Includes fixes from Steven Rostedt.

This RCU flavor is assumed to have very infrequent latency-tolerant
updaters. This assumption permits significant simplifications, including
a single global callback list protected by a single global lock, along
with a single task-private linked list containing all tasks that have not
yet passed through a quiescent state. If experience shows this assumption
to be incorrect, the required additional complexity will be added.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 73a860cd 14-Aug-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Replace flush_signals() with WARN_ON(signal_pending())

Currently, when RCU awakens from a wait_event_interruptible() that
might have awakened prematurely, it does a flush_signals(). This is
done on the off-chance that someone figured out how to deliver a signal
to a kthread, which is supposed to be impossible. Given that this
is supposed to be impossible, this commit changes the flush_signals()
calls into WARN_ON(signal_pending()).

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 2aa792e6 12-Aug-2014 Pranith Kumar <bobby.prani@gmail.com>

rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads

The rcu_gp_kthread_wake() function checks for three conditions before
waking up grace period kthreads:

* Is the thread we are trying to wake up the current thread?
* Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
* Is there no thread created for this flavour, hence nothing to wake up?

If any one of these condition is true, we do not call wake_up().
It was found that there are quite a few avoidable wake ups both during
idle time and under stress induced by rcutorture.

Idle:

Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0

rcutorture:

Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0

Here case{1-3} are the cases listed above. We can avoid these wake
ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
period kthreads.

There is a comment about an implied barrier supplied by the wake_up()
logic. This barrier is necessary for the awakened thread to see the
updated ->gp_flags. This flag is always being updated with the root node
lock held. Also, the awakened thread tries to acquire the root node lock
before reading ->gp_flags because of which there is proper ordering.

Hence this commit tries to avoid calling wake_up() whenever we can by
using rcu_gp_kthread_wake() function.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 66d701ea 16-Jul-2014 Pranith Kumar <bobby.prani@gmail.com>

rcu: Remove stale comment in tree.c

This commit removes a stale comment in rcu/tree.c which was left
out when some code was moved around previously in commit 2036d94a7b61
("rcu: Rework detection of use of RCU by offline CPUs") For reference,
the following updated comment exists a few lines below this which means
the same:

/* Remove the outgoing CPU from the masks in the rcu_node hierarchy. */

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# a8a29b3b 12-Jul-2014 Ard Biesheuvel <ardb@kernel.org>

rcu: Define tracepoint strings only if CONFIG_TRACING is set

Commit f7f7bac9cb1c ("rcu: Have the RCU tracepoints use the tracepoint_string
infrastructure") unconditionally populates the __tracepoint_str input section,
but this section is not assigned an output section if CONFIG_TRACING is not set.
This results in the __tracepoint_str turning up in unexpected places, i.e.,
after _edata.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e02b2edf 08-Jul-2014 Pranith Kumar <bobby.prani@gmail.com>

rcu: Use true/false instead of 1/0 for a bool type

This commit uses true/false instead of 1/0 for bool types in rcu_gp_fqs()
and force_qs_rnp().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f534ed1f 08-Jul-2014 Pranith Kumar <bobby.prani@gmail.com>

rcu: Use bool type for return value in rcu_is_watching()

Use a bool type for return in rcu_is_watching().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4de376a1 08-Jul-2014 Pranith Kumar <bobby.prani@gmail.com>

rcu: Remove remaining read-modify-write ACCESS_ONCE() calls

Change the remaining uses of ACCESS_ONCE() so that each ACCESS_ONCE() either does a load or a store, but not both.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 11992c70 23-Jun-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove CONFIG_PROVE_RCU_DELAY

The CONFIG_PROVE_RCU_DELAY Kconfig parameter doesn't appear to be very
effective at finding race conditions, so this commit removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
[ paulmck: Remove definition and uses as noted by Paul Bolle. ]


# d860d403 19-Jun-2014 Shan Wei <davidshan@tencent.com>

rcu: Use __this_cpu_read() instead of per_cpu_ptr()

The __this_cpu_read() function produces better code than does
per_cpu_ptr() on both ARM and x86. For example, gcc (Ubuntu/Linaro
4.7.3-12ubuntu1) 4.7.3 produces the following:

ARMv7 per_cpu_ptr():

force_quiescent_state:
mov r3, sp @,
bic r1, r3, #8128 @ tmp171,,
ldr r2, .L98 @ tmp169,
bic r1, r1, #63 @ tmp170, tmp171,
ldr r3, [r0, #220] @ __ptr, rsp_6(D)->rda
ldr r1, [r1, #20] @ D.35903_68->cpu, D.35903_68->cpu
mov r6, r0 @ rsp, rsp
ldr r2, [r2, r1, asl #2] @ tmp173, __per_cpu_offset
add r3, r3, r2 @ tmp175, __ptr, tmp173
ldr r5, [r3, #12] @ rnp_old, D.29162_13->mynode

ARMv7 __this_cpu_read():

force_quiescent_state:
ldr r3, [r0, #220] @ rsp_7(D)->rda, rsp_7(D)->rda
mov r6, r0 @ rsp, rsp
add r3, r3, #12 @ __ptr, rsp_7(D)->rda,
ldr r5, [r2, r3] @ rnp_old, *D.29176_13

Using gcc 4.8.2:

x86_64 per_cpu_ptr():

movl %gs:cpu_number,%edx # cpu_number, pscr_ret__
movslq %edx, %rdx # pscr_ret__, pscr_ret__
movq __per_cpu_offset(,%rdx,8), %rdx # __per_cpu_offset, tmp93
movq %rdi, %r13 # rsp, rsp
movq 1000(%rdi), %rax # rsp_9(D)->rda, __ptr
movq 24(%rdx,%rax), %r12 # _15->mynode, rnp_old

x86_64 __this_cpu_read():

movq %rdi, %r13 # rsp, rsp
movq 1000(%rdi), %rax # rsp_9(D)->rda, rsp_9(D)->rda
movq %gs:24(%rax),%r12 # _10->mynode, rnp_old

Because this change produces significant benefits for these two very
diverse architectures, this commit makes this change.

Signed-off-by: Shan Wei <davidshan@tencent.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>


# bc1dce51 18-Jun-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't use NMIs to dump other CPUs' stacks

Although NMI-based stack dumps are in principle more accurate, they are
also more likely to trigger deadlocks. This commit therefore replaces
all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so
that the CPU detecting an RCU CPU stall does the stack dumping.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>


# 48bd8e9b 11-Jun-2014 Pranith Kumar <bobby.prani@gmail.com>

rcu: Check both root and current rcu_node when setting up future grace period

The rcu_start_future_gp() function checks the current rcu_node's ->gpnum
and ->completed twice, once without ACCESS_ONCE() and once with it.
Which is pointless because we hold that rcu_node's ->lock at that point.
The intent was to check the current rcu_node structure and the root
rcu_node structure, the latter locklessly with ACCESS_ONCE(). This
commit therefore makes that change.

The reason that it is safe to locklessly check the root rcu_nodes's
->gpnum and ->completed fields is that we hold the current rcu_node's
->lock, which constrains the root rcu_node's ability to change its
->gpnum and ->completed fields. Of course, if there is a single rcu_node
structure, then rnp_root==rnp, and holding the lock prevents all changes.
If there is more than one rcu_node structure, then the code updates the
fields in the following order:

1. Increment rnp_root->gpnum to start new grace period.
2. Increment rnp->gpnum to initialize the current rcu_node,
continuing initialization for the new grace period.
3. Increment rnp_root->completed to end the current grace period.
4. Increment rnp->completed to continue cleaning up after the
old grace period.

So there are four possible combinations of relative values of these
four fields:

N N N N: RCU idle, new grace period must be initiated.
Although rnp_root->gpnum might be incremented immediately
after we check, that will just result in unnecessary work.
The grace period already started, and we try to start it.

N+1 N N N: RCU grace period just started. No further change is
possible because we hold rnp->lock, so the checks of
rnp_root->gpnum and rnp_root->completed are stable.
We know that our request for a future grace period will
be seen during grace-period cleanup.

N+1 N N+1 N: RCU grace period is ongoing. Because rnp->gpnum is
different than rnp->completed, we won't even look at
rnp_root->gpnum and rnp_root->completed, so the possible
concurrent change to rnp_root->completed does not matter.
We know that our request for a future grace period will
be seen during grace-period cleanup, which cannot pass
this rcu_node because we hold its ->lock.

N+1 N+1 N+1 N: RCU grace period has ended, but not yet been cleaned up.
Because rnp->gpnum is different than rnp->completed, we
won't look at rnp_root->gpnum and rnp_root->completed, so
the possible concurrent change to rnp_root->completed does
not matter. We know that our request for a future grace
period will be seen during grace-period cleanup, which
cannot pass this rcu_node because we hold its ->lock.

Therefore, despite initial appearances, the lockless check is safe.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
[ paulmck: Update comment to say why the lockless check is safe. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 1146edcb 09-Jun-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Loosen __call_rcu()'s rcu_head alignment constraint

The m68k architecture aligns only to 16-bit boundaries, which can cause
the align-to-32-bits check in __call_rcu() to trigger. Because there is
currently no known potential need for more than one low-order bit, this
commit loosens the check to 16-bit boundaries.

Reported-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>


# a792563b 02-Jun-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Eliminate read-modify-write ACCESS_ONCE() calls

RCU contains code of the following forms:

ACCESS_ONCE(x)++;
ACCESS_ONCE(x) += y;
ACCESS_ONCE(x) -= y;

Now these constructs do operate correctly, but they really result in a
pair of volatile accesses, one to do the load and another to do the store.
This can be confusing, as the casual reader might well assume that (for
example) gcc might generate a memory-to-memory add instruction for each
of these three cases. In fact, gcc will do no such thing. Also, there
is a good chance that the kernel will move to separate load and store
variants of ACCESS_ONCE(), and constructs like the above could easily
confuse both people and scripts attempting to make that sort of change.
Finally, most of RCU's read-modify-write uses of ACCESS_ONCE() really
only need the store to be volatile, so that the read-modify-write form
might be misleading.

This commit therefore changes the above forms in RCU so that each instance
of ACCESS_ONCE() either does a load or a store, but not both. In a few
cases, ACCESS_ONCE() was not critical, for example, for maintaining
statisitics. In these cases, ACCESS_ONCE() has been dispensed with
entirely.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# b4426b49 06-May-2014 Fabian Frederick <fabf@skynet.be>

rcu: Make rcu node arrays static const char * const

Those two arrays are being passed to lockdep_init_map(), which expects
const char *, and are stored in lockdep_map the same way.

Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4a81e832 20-Jun-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Reduce overhead of cond_resched() checks for RCU

Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
fixed a problem where a CPU looping in the kernel with but one runnable
task would give RCU CPU stall warnings, even if the in-kernel loop
contained cond_resched() calls. Unfortunately, in so doing, it introduced
performance regressions in Anton Blanchard's will-it-scale "open1" test.
The problem appears to be not so much the increased cond_resched() path
length as an increase in the rate at which grace periods complete, which
increased per-update grace-period overhead.

This commit takes a different approach to fixing this bug, mainly by
moving the RCU-visible quiescent state from cond_resched() to
rcu_note_context_switch(), and by further reducing the check to a
simple non-zero test of a single per-CPU variable. However, this
approach requires that the force-quiescent-state processing send
resched IPIs to the offending CPUs. These will be sent only once
the grace period has reached an age specified by the boot/sysfs
parameter rcutree.jiffies_till_sched_qs, or once the grace period
reaches an age halfway to the point at which RCU CPU stall warnings
will be emitted, whichever comes first.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
ktest build robot. Also fixed smp_mb() comment as noted by
Oleg Nesterov. ]

Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# e534165b 23-Mar-2014 Uma Sharma <uma.sharma523@gmail.com>

rcu: Variable name changed in tree_plugin.h and used in tree.c

The variable and struct both having the name "rcu_state" confuses
sparse in some situations, so this commit changes the variable to
"rcu_state_p" in order to avoid this confusion. This also makes
things easier for human readers.

Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
[ paulmck: Changed the declaration and several additional uses. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# afea227f 12-Mar-2014 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Export RCU grace-period kthread wait state to rcutorture

This commit allows rcutorture to print additional state for the
RCU grace-period kthreads in cases where RCU seems reluctant to
start a new grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# ad0dc7f9 19-Feb-2014 Paul E. McKenney <paulmck@kernel.org>

rcutorture: Add forward-progress checking for writer

The rcutorture output currently does not distinguish between stalls in
the RCU implementation and stalls in the rcu_torture_writer() kthreads.
This commit therefore adds some diagnostics to help distinguish between
these two conditions, at least for the non-SRCU implementations. (SRCU
does not provide evidence of update-side forward progress by design.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# fa07a58f 15-Apr-2014 Christoph Lameter <cl@linux.com>

rcu: Replace __this_cpu_ptr() uses with raw_cpu_ptr()

__this_cpu_ptr is being phased out.

One special case is increment_cpu_stall_ticks().
A per cpu variable is incremented so use raw_cpu_inc().

Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 8c96ae1d 14-Apr-2014 Pranith Kumar <pranith@gatech.edu>

rcu: Remove duplicate resched_cpu() declaration

Signed-off-by: Pranith Kumar <pranith@gatech.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# a381d757 19-Mar-2014 Andreea-Cristina Bernat <bernat.ada@gmail.com>

rcu: Merge rcu_sched_force_quiescent_state() with rcu_force_quiescent_state()

This patch merges the function rcu_force_quiescent_state() with
rcu_sched_force_quiescent_state(), using the rcu_state pointer. Firstly,
the rcu_sched_force_quiescent_state() function is deleted from the file
kernel/rcu/tree.c. Also, the rcu_force_quiescent_state() function that was
calling force_quiescent_state with the argument rcu_preempt_state pointer
was deleted as well. The new function that combines the old ones uses
the rcu_state pointer and is located after rcu_batches_completed_bh()
in kernel/rcu/tree.c.

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 495aa969 18-Mar-2014 Andreea-Cristina Bernat <bernat.ada@gmail.com>

rcu: Consolidate kfree_call_rcu() to use rcu_state pointer

kfree_call_rcu is defined two times. When defined under CONFIG_TREE_PREEMPT_RCU,
it uses rcu_preempt_state. Otherwise, it uses rcu_sched_state.
This patch uses the rcu_state_pointer to combine the two definitions into one.
The resulting function is placed after the closing of the preprocessor
conditional CONFIG_TREE_PREEMPT_RCU.

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 595f3900 18-Mar-2014 Himangi Saraogi <himangi774@gmail.com>

rcu: Replace NR_CPUS with nr_cpu_ids

This patch replaces NR_CPUS with nr_cpu_ids as NR_CPUS should
consider cpumask_var_t.

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 7941dbde 17-Mar-2014 Andreea-Cristina Bernat <bernat.ada@gmail.com>

rcu: Add event tracing to dyntick_save_progress_counter().

This patch adds event tracing to dyntick_save_progress_counter() in the case
where it returns 1. I used the tracepoint string "dti" because this function
returns 1 in case the CPU is in dynticks idle mode.

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 48a7639c 11-Mar-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Make callers awaken grace-period kthread

The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.

Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.

However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.

In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.

This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 4fc5b755 11-Mar-2014 Iulia Manda <iulia.manda21@gmail.com>

rcu: Protect uses of jiffies_stall field with ACCESS_ONCE()

Some of the uses of the rcu_state structure's ->jiffies_stall field
do not use ACCESS_ONCE(), despite there being unprotected accesses.
This commit therefore uses the ACCESS_ONCE() macro to protect this field.

Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 9b67122a 11-Mar-2014 Iulia Manda <iulia.manda21@gmail.com>

rcu: Remove unused rcu_data structure field

The ->preemptible field in rcu_data is only initialized in the function
rcu_init_percpu_data(), and never used. This commit therefore removes
this field.

Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 365187fb 10-Mar-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Update cpu_needs_another_gp() for futures from non-NOCB CPUs

In the old days, the only source of requests for future grace periods
was NOCB CPUs. This has changed: CPUs routinely post requests for
future grace periods in order to promote power efficiency and reduce
OS jitter with minimal impact on grace-period latency. This commit
therefore updates cpu_needs_another_gp() to invoke rcu_future_needs_gp()
instead of rcu_nocb_needs_gp(). The latter is no longer used, so is
now removed. This commit also adds tracing for the irq_work_queue()
wakeup case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 83ebe63e 06-Mar-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Print negatives for stall-warning counter wraparound

The print_other_cpu_stall() and print_cpu_stall() functions print
grace-period numbers using an unsigned format, which means that the number
one less than zero is a very large number. This commit therefore causes
these numbers to be printed with a signed format in order to improve
readability of the RCU CPU stall-warning output.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 91dc9542 18-Feb-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Protect ->gp_flags accesses with ACCESS_ONCE()

A number of ->gp_flags accesses don't have ACCESS_ONCE(), but all of
the can race against other loads or stores. This commit therefore
applies ACCESS_ONCE() to the unprotected ->gp_flags accesses.

Reported-by: Alexey Roytman <alexey.roytman@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 4e857c58 17-Mar-2014 Peter Zijlstra <peterz@infradead.org>

arch: Mass conversion of smp_mb__*()

Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 765a3f4f 14-Mar-2014 Paul E. McKenney <paulmck@kernel.org>

rcu: Provide grace-period piggybacking API

The following pattern is currently not well supported by RCU:

1. Make data element inaccessible to RCU readers.

2. Do work that probably lasts for more than one grace period.

3. Do something to make sure RCU readers in flight before #1 above
have completed.

Here are some things that could currently be done:

a. Do a synchronize_rcu() unconditionally at either #1 or #3 above.
This works, but imposes needless work and latency.

b. Post an RCU callback at #1 above that does a wakeup, then
wait for the wakeup at #3. This works well, but likely results
in an extra unneeded grace period. Open-coding this is also
a bit more semi-tricky code than would be good.

This commit therefore adds get_state_synchronize_rcu() and
cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1
above and pass its return value to cond_synchronize_rcu() at #3 above.
This results in a call to synchronize_rcu() if no grace period has
elapsed between #1 and #3, but requires only a load, comparison, and
memory barrier if a full grace period did elapse.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>


# 5cb5c6e1 19-Feb-2014 Paul Gortmaker <paul.gortmaker@windriver.com>

rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone

The kbuild test bot uncovered an implicit dependence on the
trace header being present before rcu.h in ia64 allmodconfig
that looks like this:

In file included from kernel/ksysfs.c:22:0:
kernel/rcu/rcu.h: In function '__rcu_reclaim':
kernel/rcu/rcu.h:107:3: error: implicit declaration of function 'trace_rcu_invoke_kfree_callback' [-Werror=implicit-function-declaration]
kernel/rcu/rcu.h:112:3: error: implicit declaration of function 'trace_rcu_invoke_callback' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors

Looking at other rcu.h users, we can find that they all
were sourcing the trace header in advance of rcu.h itself,
as seen in the context of this diff. There were also some
inconsistencies as to whether it was or wasn't sourced based
on the parent tracing Kconfig.

Rather than "fix" it at each use site, and have inconsistent
use based on whether "#ifdef CONFIG_RCU_TRACE" was used or not,
lets just source the trace header just once, in the actual consumer
of it, which is rcu.h itself. We include it unconditionally, as
build testing shows us that is a hard requirement for some files.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# ffa83fb5 17-Nov-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Optimize rcu_needs_cpu() for RCU_NOCB_CPU_ALL

If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_needs_cpu() will always
return false, however, the current version nevertheless checks
for RCU callbacks. This commit therefore creates a static inline
implementation of rcu_needs_cpu() that unconditionally returns false
when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# cb1e78cf 04-Dec-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove ACCESS_ONCE() from jiffies

Because jiffies is one of a very few variables marked "volatile", there
is no need to use ACCESS_ONCE() when accessing it. This commit therefore
removes the redundant ACCESS_ONCE() wrappers.

Reported by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 87de1cfd 03-Dec-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Stop tracking FSF's postal address

All of the RCU source files have the usual GPL header, which contains a
long-obsolete postal address for FSF. To avoid the need to track the
FSF office's movements, this commit substitutes the URL where GPL may
be found.

Reported-by: Greg KH <gregkh@linuxfoundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 3660c281 03-Dec-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Add ACCESS_ONCE() to ->n_force_qs_lh accesses

The ->n_force_qs_lh field is accessed without the benefit of any
synchronization, so this commit adds the needed ACCESS_ONCE() wrappers.
Yes, increments to ->n_force_qs_lh can be lost, but contention should
be low and the field is strictly statistical in nature, so this is not
a problem.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 6303b9c8 11-Dec-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods

RCU must ensure that there is the equivalent of a full memory
barrier between any memory access preceding grace period and any
memory access following that same grace period, regardless of
which CPU(s) happen to execute the two memory accesses.
Therefore, downgrading UNLOCK+LOCK to no longer imply a full
memory barrier requires some adjustments to RCU.

This commit therefore adds smp_mb__after_unlock_lock()
invocations as needed after the RCU lock acquisitions that need
to be part of a full-memory-barrier UNLOCK+LOCK.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1386799151-2219-7-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a096932f 08-Nov-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Don't activate RCU core on NO_HZ_FULL CPUs

Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see
if the RCU core needs anything from this CPU. If so, RCU raises
RCU_SOFTIRQ to carry out any needed processing.

This approach has worked well historically, but it is undesirable on
NO_HZ_FULL CPUs. Such CPUs are expected to spend almost all of their time
in userspace, so that scheduling-clock interrupts can be disabled while
there is only one runnable task on the CPU in question. Unfortunately,
raising any softirq has the potential to wake up ksoftirqd, which would
provide the second runnable task on that CPU, preventing disabling of
scheduling-clock interrupts.

What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone,
relying on the grace-period kthreads' quiescent-state forcing to
do any needed RCU work on behalf of those CPUs.

This commit therefore refrains from raising RCU_SOFTIRQ on any
NO_HZ_FULL CPUs during any grace periods that have been in effect
for less than one second. The one-second limit handles the case
where an inappropriate workload is running on a NO_HZ_FULL CPU
that features lots of scheduling-clock interrupts, but no idle
or userspace time.

Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <bitbucket@online.de>
Toasted-by: Frederic Weisbecker <fweisbec@gmail.com>


# 04f34650 16-Oct-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf values

Each element of the rcu_state structure's ->levelspread[] array
is intended to contain the per-level fanout, where the zero-th
element corresponds to the root of the rcu_node tree, and the last
element corresponds to the leaves. In the CONFIG_RCU_FANOUT_EXACT
case, this means that the last element should be filled in
from CONFIG_RCU_FANOUT_LEAF (or from the rcu_fanout_leaf boot
parameter, if provided) and that the remaining elements should
be filled in from CONFIG_RCU_FANOUT. Unfortunately, the current
code in rcu_init_levelspread() takes the opposite approach, placing
CONFIG_RCU_FANOUT_LEAF in the zero-th element and CONFIG_RCU_FANOUT in
the remaining elements.

For typical power-of-two values, this generates odd but functional
rcu_node trees. However, other values, for example CONFIG_RCU_FANOUT=3
and CONFIG_RCU_FANOUT_LEAF=2, generate trees that can leave some CPUs
out of the grace-period computation, resulting in too-short grace periods
and therefore a broken RCU implementation.

This commit therefore fixes rcu_init_levelspread() to set the last
->levelspread[] array element from CONFIG_RCU_FANOUT_LEAF and the
remaining elements from CONFIG_RCU_FANOUT, thus generating the
intended rcu_node trees.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# f6f7ee9a 10-Oct-2013 Fengguang Wu <fengguang.wu@intel.com>

rcu: Fix coccinelle warnings

This commit fixes the following coccinelle warning:

kernel/rcu/tree.c:712:9-10: WARNING: return of 0/1 in function
'rcu_lockdep_current_cpu_online' with return type bool

Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: coccinelle/misc/boolreturn.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 39479098 09-Oct-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Let the world know when RCU adjusts its geometry

Some RCU bugs have been specific to the layout of the rcu_node tree,
but RCU will silently adjust the tree at boot time if appropriate.
This obscures valuable debugging information, so print a message when
this happens.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 3a592405 04-Oct-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Allow task-level idle entry/exit nesting

The current task-level idle entry/exit code forces an entry/exit on
each call, regardless of the nesting level. This commit therefore
properly accounts for nesting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>


# 96d3fd0d 04-Oct-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Break call_rcu() deadlock involving scheduler and perf

Dave Jones got the following lockdep splat:

> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2

The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.

One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:

1. Determine when it is likely that a relevant scheduler lock is held.

2. Defer the wakeup in such cases.

3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.

We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.

The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.

This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>


# 78e4bc34 24-Sep-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Fix and comment ordering around wait_event()

It is all too easy to forget that wait_event() does not necessarily
imply a full memory barrier. The case where it does not is where the
condition transitions to true just as wait_event() starts execution.
This is actually a feature: The standard use of wait_event() involves
locking, in which case the locks provide the needed ordering (you hold a
lock across the wake_up() and acquire that same lock after wait_event()
returns).

Given that I did forget that wait_event() does not necessarily imply a
full memory barrier in one case, this commit fixes that case. This commit
also adds comments calling out the placement of existing memory barriers
relied on by wait_event() calls.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6193c76a 23-Sep-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Kick CPU halfway to RCU CPU stall warning

When an RCU CPU stall warning occurs, the CPU invokes resched_cpu() on
itself. This can help move the grace period forward in some situations,
but it would be even better to do this -before- the RCU CPU stall warning.
This commit therefore causes resched_cpu() to be called every five jiffies
once the system is halfway to an RCU CPU stall warning.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 4102adab 08-Oct-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Move RCU-related source code to kernel/rcu directory

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>