History log of /linux-master/include/linux/hardirq.h
Revision Date Author Comments
# c33ef43a 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

rcu/context-tracking: Remove unused and/or unecessary middle functions

Some eqs functions are now only used internally by context tracking, so
their public declarations can be removed.

Also middle functions such as rcu_user_*() and rcu_idle_*()
which now directly call to rcu_eqs_enter() and rcu_eqs_exit() can be
wiped out as well.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# 493c1822 08-Jun-2022 Frederic Weisbecker <frederic@kernel.org>

context_tracking: Take NMI eqs entrypoints over RCU

The RCU dynticks counter is going to be merged into the context tracking
subsystem. Prepare with moving the NMI extended quiescent states
entrypoints to context tracking. For now those are dumb redirection to
existing RCU calls.

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>


# 85e3e7fb 15-Jul-2021 John Ogness <john.ogness@linutronix.de>

printk: remove NMI tracking

All NMI contexts are handled the same as the safe context: store the
message and defer printing. There is no need to have special NMI
context tracking for this. Using in_nmi() is enough.

There are several parts of the kernel that are manually calling into
the printk NMI context tracking in order to cause general printk
deferred printing:

arch/arm/kernel/smp.c
arch/powerpc/kexec/crash.c
kernel/trace/trace.c

For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new
function pair printk_deferred_enter/exit that explicitly achieves the
same objective.

For ftrace, remove the printk context manipulation completely. It was
added in commit 03fc7f9c99c1 ("printk/nmi: Prevent deadlock when
accessing the main log buffer in NMI"). The purpose was to enforce
storing messages directly into the ring buffer even in NMI context.
It really should have only modified the behavior in NMI context.
There is no need for a special behavior any longer. All messages are
always stored directly now. The console deferring is handled
transparently in vprintk().

Signed-off-by: John Ogness <john.ogness@linutronix.de>
[pmladek@suse.com: Remove special handling in ftrace.c completely.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210715193359.25946-5-john.ogness@linutronix.de


# 728b478d 09-Mar-2021 Thomas Gleixner <tglx@linutronix.de>

softirq: Add RT specific softirq accounting

RT requires the softirq processing and local bottomhalf disabled regions to
be preemptible. Using the normal preempt count based serialization is
therefore not possible because this implicitely disables preemption.

RT kernels use a per CPU local lock to serialize bottomhalfs. As
local_bh_disable() can nest the lock can only be acquired on the outermost
invocation of local_bh_disable() and released when the nest count becomes
zero. Tasks which hold the local lock can be preempted so its required to
keep track of the nest count per task.

Add a RT only counter to task struct and adjust the relevant macros in
preempt.h.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210309085726.983627589@linutronix.de


# d3759e71 01-Dec-2020 Frederic Weisbecker <frederic@kernel.org>

irqtime: Move irqtime entry accounting after irq offset incrementation

IRQ time entry is currently accounted before HARDIRQ_OFFSET or
SOFTIRQ_OFFSET are incremented. This is convenient to decide to which
index the cputime to account is dispatched.

Unfortunately it prevents tick_irq_enter() from being called under
HARDIRQ_OFFSET because tick_irq_enter() has to be called before the IRQ
entry accounting due to the necessary clock catch up. As a result we
don't benefit from appropriate lockdep coverage on tick_irq_enter().

To prepare for fixing this, move the IRQ entry cputime accounting after
the preempt offset is incremented. This requires the cputime dispatch
code to handle the extra offset.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201202115732.27827-5-frederic@kernel.org


# ba1f2b2e 27-May-2020 Peter Zijlstra <peterz@infradead.org>

x86/entry: Fix NMI vs IRQ state tracking

While the nmi_enter() users did
trace_hardirqs_{off_prepare,on_finish}() there was no matching
lockdep_hardirqs_*() calls to complete the picture.

Introduce idtentry_{enter,exit}_nmi() to enable proper IRQ state
tracking across the NMIs.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200623083721.216740948@infradead.org


# 98a3bf19 21-May-2020 Thomas Gleixner <tglx@linutronix.de>

genirq: Provide __irq_enter/exit_raw()

Like __irq_enter/exit() but without time accounting. To be used for "empty"
system vectors like the scheduler IPI to avoid the overhead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200521202117.671682341@linutronix.de


# 8a6bc478 21-May-2020 Thomas Gleixner <tglx@linutronix.de>

genirq: Provide irq_enter/exit_rcu()

irq_enter()/exit() currently include RCU handling. To properly separate the RCU
handling code, provide variants which contain only the non-RCU related
functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200521202117.567023613@linutronix.de


# 2ab70319 21-May-2020 Thomas Gleixner <tglx@linutronix.de>

nmi, tracing: Make hardware latency tracing noinstr safe

The hardware latency tracer calls into instrumentable functions. Move the
calls into the RCU watching sections and annotate them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20200521202116.904176298@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# aaf2bc50 21-May-2020 Paul E. McKenney <paulmck@kernel.org>

rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()

There will likely be exception handlers that can sleep, which rules
out the usual approach of invoking rcu_nmi_enter() on entry and also
rcu_nmi_exit() on all exit paths. However, the alternative approach of
just not calling anything can prevent RCU from coaxing quiescent states
from nohz_full CPUs that are looping in the kernel: RCU must instead
IPI them explicitly. It would be better to enable the scheduler tick
on such CPUs to interact with RCU in a lighter-weight manner, and this
enabling is one of the things that rcu_nmi_enter() currently does.

What is needed is something that helps RCU coax quiescent states while
not preventing subsequent sleeps. This commit therefore splits out the
nohz_full scheduler-tick enabling from the rest of the rcu_nmi_enter()
logic into a new function named rcu_irq_enter_check_tick().

[ tglx: Renamed the function and made it a nop when context tracking is off ]
[ mingo: Fixed a CONFIG_NO_HZ_FULL assumption, harmonized and fixed all the
comment blocks and cleaned up rcu_nmi_enter()/exit() definitions. ]

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200521202116.996113173@linutronix.de


# f93524eb 12-Feb-2020 Peter Zijlstra <peterz@infradead.org>

sched,rcu,tracing: Avoid tracing before in_nmi() is correct

If a tracer is invoked before in_nmi() becomes true, the tracer can no
longer detect it is called from NMI context and behave correctly.

Therefore change nmi_{enter,exit}() to use __preempt_count_{add,sub}()
as the normal preempt_count_{add,sub}() have a (desired) function
trace entry.

This fixes a potential issue with the current code; when the function-tracer
has stack-tracing enabled __trace_stack() will malfunction when it hits the
preempt_count_add() function entry from NMI context.

Suggested-by: Steven Rostedt (VMware) <rosted@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.434193525@linutronix.de


# 69ea03b5 19-Feb-2020 Peter Zijlstra <peterz@infradead.org>

hardirq/nmi: Allow nested nmi_enter()

Since there are already a number of sites (ARM64, PowerPC) that effectively
nest nmi_enter(), make the primitive support this before adding even more.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lkml.kernel.org/r/20200505134100.864179229@linutronix.de


# 2502ec37 19-Mar-2020 Thomas Gleixner <tglx@linutronix.de>

lockdep: Rename trace_hardirq_{enter,exit}()

Continue what commit:

d820ac4c2fa8 ("locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]")

started, rename these to avoid confusing them with tracepoints.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200320115859.060481361@infradead.org


# 5870970b 31-Jan-2019 Julien Thierry <julien.thierry.kdev@gmail.com>

arm64: Fix HCR.TGE status for NMI contexts

When using VHE, the host needs to clear HCR_EL2.TGE bit in order
to interact with guest TLBs, switching from EL2&0 translation regime
to EL1&0.

However, some non-maskable asynchronous event could happen while TGE is
cleared like SDEI. Because of this address translation operations
relying on EL2&0 translation regime could fail (tlb invalidation,
userspace access, ...).

Fix this by properly setting HCR_EL2.TGE when entering NMI context and
clear it if necessary when returning to the interrupted context.

Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 42a0bb3f 20-May-2016 Petr Mladek <pmladek@suse.com>

printk/nmi: generic solution for safe printk in NMI

printk() takes some locks and could not be used a safe way in NMI
context.

The chance of a deadlock is real especially when printing stacks from
all CPUs. This particular problem has been addressed on x86 by the
commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all
CPUs").

The patchset brings two big advantages. First, it makes the NMI
backtraces safe on all architectures for free. Second, it makes all NMI
messages almost safe on all architectures (the temporary buffer is
limited. We still should keep the number of messages in NMI context at
minimum).

Note that there already are several messages printed in NMI context:
WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
handlers. These are not easy to avoid.

This patch reuses most of the code and makes it generic. It is useful
for all messages and architectures that support NMI.

The alternative printk_func is set when entering and is reseted when
leaving NMI context. It queues IRQ work to copy the messages into the
main ring buffer in a safe context.

__printk_nmi_flush() copies all available messages and reset the buffer.
Then we could use a simple cmpxchg operations to get synchronized with
writers. There is also used a spinlock to get synchronized with other
flushers.

We do not longer use seq_buf because it depends on external lock. It
would be hard to make all supported operations safe for a lockless use.
It would be confusing and error prone to make only some operations safe.

The code is put into separate printk/nmi.c as suggested by Steven
Rostedt. It needs a per-CPU buffer and is compiled only on
architectures that call nmi_enter(). This is achieved by the new
HAVE_NMI Kconfig flag.

The are MN10300 and Xtensa architectures. We need to clean up NMI
handling there first. Let's do it separately.

The patch is heavily based on the draft from Peter Zijlstra, see

https://lkml.org/lkml/2015/6/10/327

[arnd@arndb.de: printk-nmi: use %zu format string for size_t]
[akpm@linux-foundation.org: min_t->min - all types are size_t here]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> [arm part]
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jiri Kosina <jkosina@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 92cf2118 12-May-2015 Frederic Weisbecker <fweisbec@gmail.com>

sched/preempt: Merge preempt_mask.h into preempt.h

preempt_mask.h defines all the preempt_count semantics and related
symbols: preempt, softirq, hardirq, nmi, preempt active, need resched,
etc...

preempt.h defines the accessors and mutators of preempt_count.

But there is a messy dependency game around those two header files:

* preempt_mask.h includes preempt.h in order to access preempt_count()

* preempt_mask.h defines all preempt_count semantic and symbols
except PREEMPT_NEED_RESCHED that is needed by asm/preempt.h
Thus we need to define it from preempt.h, right before including
asm/preempt.h, instead of defining it to preempt_mask.h with the
other preempt_count symbols. Therefore the preempt_count semantics
happen to be spread out.

* We plan to introduce preempt_active_[enter,exit]() to consolidate
preempt_schedule*() code. But we'll need to access both preempt_count
mutators (preempt_count_add()) and preempt_count symbols
(PREEMPT_ACTIVE, PREEMPT_OFFSET). The usual place to define preempt
operations is in preempt.h but then we'll need symbols in
preempt_mask.h which already includes preempt.h. So we end up with
a ressource circle dependency.

Lets merge preempt_mask.h into preempt.h to solve these dependency issues.
This way we gather semantic symbols and operation definition of
preempt_count in a single file.

This is a dumb copy-paste merge. Further merge re-arrangments are
performed in a subsequent patch to ease review.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431441711-29753-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 02cea395 05-Feb-2015 Peter Zijlstra <peterz@infradead.org>

genirq: Provide disable_hardirq()

For things like netpoll there is a need to disable an interrupt from
atomic context. Currently netpoll uses disable_irq() which will
sleep-wait on threaded handlers and thus forced_irqthreads breaks
things.

Provide disable_hardirq(), which uses synchronize_hardirq() to only wait
for active hardirq handlers; also change synchronize_hardirq() to
return the status of threaded handlers.

This will allow one to try-disable an interrupt from atomic context, or
in case of request_threaded_irq() to only wait for the hardirq part.

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Miller <davem@davemloft.net>
Cc: Eyal Perry <eyalpe@mellanox.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Quentin Lambert <lambert.quentin@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Link: http://lkml.kernel.org/r/20150205130623.GH5029@twins.programming.kicks-ass.net
[ Fixed typos and such. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 18258f72 14-Feb-2014 Thomas Gleixner <tglx@linutronix.de>

genirq: Provide synchronize_hardirq()

synchronize_irq() waits for hard irq and threaded handlers to complete
before returning. For some special cases we only need to make sure
that the hard interrupt part of the irq line is not in progress when
we disabled the - possibly shared - interrupt at the device level.

A proper use case for this was provided by Russell. The sdhci driver
requires some irq triggered functions to be run in thread context. The
current implementation of the thread context is a sdio private kthread
construct, which has quite some shortcomings. These can be avoided
when the thread is directly associated to the device interrupt via the
generic threaded irq infrastructure.

Though there is a corner case related to run time power management
where one side disables the device interrupts at the device level and
needs to make sure, that an already running hard interrupt handler has
completed before proceeding further. Though that hard interrupt
handler might wake the associated thread, which in turn can request
the runtime PM to reenable the device. Using synchronize_irq() leads
to an immediate deadlock of the irq thread waiting for the PM lock and
the synchronize_irq() waiting for the irq thread to complete.

Due to the fact that it is sufficient for this case to ensure that no
hard irq handler is executing a new function which avoids the check
for the thread is required.

Add a function, which just monitors the hard irq parts and ignores the
threaded handlers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Russell King <linux@arm.linux.org.uk>
Cc: Chris Ball <chris@printf.net>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140215003823.653236081@linutronix.de


# 0bd3a173 19-Nov-2013 Peter Zijlstra <peterz@infradead.org>

sched/preempt, locking: Rework local_bh_{dis,en}able()

Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).

Also expose two new local_bh functions:

__local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);

Which implement the actual local_bh_{dis,en}able() behaviour.

The next patch uses the exposed @cnt argument to optimize bh lock
functions.

With build fixes from Jacob Pan.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# bdb43806 09-Sep-2013 Peter Zijlstra <peterz@infradead.org>

sched: Extract the basic add/sub preempt_count modifiers

Rewrite the preempt_count macros in order to extract the 3 basic
preempt_count value modifiers:

__preempt_count_add()
__preempt_count_sub()

and the new:

__preempt_count_dec_and_test()

And since we're at it anyway, replace the unconventional
$op_preempt_count names with the more conventional preempt_count_$op.

Since these basic operators are equivalent to the previous _notrace()
variants, do away with the _notrace() versions.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0244ad00 30-Aug-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

Remove GENERIC_HARDIRQ config option

After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 2d4b8473 29-Jul-2013 Frederic Weisbecker <fweisbec@gmail.com>

hardirq: Split preempt count mask definitions

In order to use static keys with vtime APIs, we'll need to
add static keys headers to vtime.h

hardirq.h then becomes a problem because it needs vtime.h
for irqtime accounting in irq_enter/irq_exit, but it's
often included just to get the irq mask definitions in the
task preempt_count field and the APIs that come along:
in_interrupt(), in_hardirq(), etc...

Some very low level arch headers sometimes need these masks
and APIs such as arch/m68k/include/asm/irqflags.h for example.
But they don't want to include hardirq.h if vtime.h, jump_label.h
and even workqueue.h come along. Including such bloated high
level header from arch headers can quickly result in circular
headers dependency that crash the build.

So let's split hardirq.h in two parts:

* preempt_mask.h that gathers all the preempt_count definitions
and the APIs associated. This one is considered low level and can
be safely included anywhere.

* hardirq.h that includes the previous one. It defines the irq
entry/exit APIs.

To avoid future circular headers dependencies, the preempt_mask.h
inclusion can replace hardirq.h on files that don't implement irq
low level handlers but just need the atomic/context check APIs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kevin Hilman <khilman@linaro.org>


# 127781d1 27-Mar-2013 Paul E. McKenney <paulmck@kernel.org>

rcu: Remove TINY_PREEMPT_RCU

TINY_PREEMPT_RCU adds significant code and complexity, but does not
offer commensurate benefits. People currently using TINY_PREEMPT_RCU
can get much better memory footprint with TINY_RCU, or, if they really
need preemptible RCU, they can use TREE_PREEMPT_RCU with a relatively
minor degradation in memory footprint. Please note that this move
has been widely publicized on LKML (https://lkml.org/lkml/2012/11/12/545)
and on LWN (http://lwn.net/Articles/541037/).

This commit therefore removes TINY_PREEMPT_RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Updated to eliminate #else in rcutiny.h as suggested by Josh ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 4d4c4e24 21-Feb-2013 Frederic Weisbecker <fweisbec@gmail.com>

irq: Remove IRQ_EXIT_OFFSET workaround

The IRQ_EXIT_OFFSET trick was used to make sure the irq
doesn't get preempted after we substract the HARDIRQ_OFFSET
until we are entirely done with any code in irq_exit().

This workaround was necessary because some archs may call
irq_exit() with irqs enabled and there is still some code
in the end of this function that is not covered by the
HARDIRQ_OFFSET but want to stay non-preemptible.

Now that irq are always disabled in irq_exit(), the whole code
is guaranteed not to be preempted. We can thus remove this hack.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 6a61671b 16-Dec-2012 Frederic Weisbecker <fweisbec@gmail.com>

cputime: Safely read cputime of full dynticks CPUs

While remotely reading the cputime of a task running in a
full dynticks CPU, the values stored in utime/stime fields
of struct task_struct may be stale. Its values may be those
of the last kernel <-> user transition time snapshot and
we need to add the tickless time spent since this snapshot.

To fix this, flush the cputime of the dynticks CPUs on
kernel <-> user transition and record the time / context
where we did this. Then on top of this snapshot and the current
time, perform the fixup on the reader side from task_times()
accessors.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[fixed kvm module related build errors]
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>


# 0f1ac8fd 15-Jan-2013 Steven Rostedt <srostedt@redhat.com>

tracing/lockdep: Disable lockdep first in entering NMI

When function tracing with either debug locks enabled or tracing
preempt disabled, the add_preempt_count() is traced. This is an
issue with lockdep and function tracing. As function tracing
can disable interrupts, and lockdep records that change,
lockdep may not be able to handle this recursion if it happens from
an NMI context.

The first thing that an NMI does is:

#define nmi_enter() \
do { \
ftrace_nmi_enter(); \
BUG_ON(in_nmi()); \
add_preempt_count(NMI_OFFSET + HARDIRQ_OFFSET); \
lockdep_off(); \
rcu_nmi_enter(); \
trace_hardirq_enter(); \
} while (0)

When the add_preempt_count() is traced, and the tracing callback
disables interrupts, it will jump into the lockdep code. There's
some places in lockdep that can't handle this re-entrance, and
causes lockdep to fail.

As the lockdep_off() (and lockdep_on) is a simple:

void lockdep_off(void)
{
current->lockdep_recursion++;
}

and is never traced, it can be called first in nmi_enter()
and lockdep_on() last in nmi_exit().

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# fa5058f3 05-Oct-2012 Frederic Weisbecker <fweisbec@gmail.com>

cputime: Specialize irq vtime hooks

With CONFIG_VIRT_CPU_ACCOUNTING, when vtime_account()
is called in irq entry/exit, we perform a check on the
context: if we are interrupting the idle task we
account the pending cputime to idle, otherwise account
to system time or its sub-areas: tsk->stime, hardirq time,
softirq time, ...

However this check for idle only concerns the hardirq entry
and softirq entry:

* Hardirq may directly interrupt the idle task, in which case
we need to flush the pending CPU time to idle.

* The idle task may be directly interrupted by a softirq if
it calls local_bh_enable(). There is probably no such call
in any idle task but we need to cover every case. Ksoftirqd
is not concerned because the idle time is flushed on context
switch and softirq in the end of hardirq have the idle time
already flushed from the hardirq entry.

In the other cases we always account to system/irq time:

* On hardirq exit we account the time to hardirq time.
* On softirq exit we account the time to softirq time.

To optimize this and avoid the indirect call to vtime_account()
and the checks it performs, specialize the vtime irq APIs and
only perform the check on irq entry. Irq exit can directly call
vtime_account_system().

CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly
maps to its own vtime_account() implementation. One may want
to take benefits from the new APIs to optimize irq time accounting
as well in the future.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>


# dcbf832e 05-Oct-2012 Frederic Weisbecker <fweisbec@gmail.com>

vtime: Gather vtime declarations to their own header file

These APIs are scattered around and are going to expand a bit.
Let's create a dedicated header file for sanity.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>


# bf9fae9f 08-Sep-2012 Frederic Weisbecker <fweisbec@gmail.com>

cputime: Use a proper subsystem naming for vtime related APIs

Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:

- account_system_vtime() -> vtime_account()
- account_switch_vtime() -> vtime_task_switch()

It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.

This also make it better to know on which subsystem these APIs
refer to.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>


# a7e4786b 20-Jul-2012 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

sched: Fix comment about PREEMPT_ACTIVE bit location

PREEMPT_ACTIVE flag is bit 27, not 28. Fix the comment.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: paulmck@linux.vnet.ibm.com
Cc: josh@joshtriplett.org
Cc: rostedt@goodmis.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120720192459.6149.14821.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9b2e4f18 30-Sep-2011 Paul E. McKenney <paulmck@kernel.org>

rcu: Track idleness independent of idle tasks

Earlier versions of RCU used the scheduling-clock tick to detect idleness
by checking for the idle task, but handled idleness differently for
CONFIG_NO_HZ=y. But there are now a number of uses of RCU read-side
critical sections in the idle task, for example, for tracing. A more
fine-grained detection of idleness is therefore required.

This commit presses the old dyntick-idle code into full-time service,
so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is
always invoked at the beginning of an idle loop iteration. Similarly,
rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked
at the end of an idle-loop iteration. This allows the idle task to
use RCU everywhere except between consecutive rcu_idle_enter() and
rcu_idle_exit() calls, in turn allowing architecture maintainers to
specify exactly where in the idle loop that RCU may be used.

Because some of the userspace upcall uses can result in what looks
to RCU like half of an interrupt, it is not possible to expect that
the irq_enter() and irq_exit() hooks will give exact counts. This
patch therefore expands the ->dynticks_nesting counter to 64 bits
and uses two separate bitfields to count process/idle transitions
and interrupt entry/exit transitions. It is presumed that userspace
upcalls do not happen in the idle loop or from usermode execution
(though usermode might do a system call that results in an upcall).
The counter is hard-reset on each process/idle transition, which
avoids the interrupt entry/exit error from accumulating. Overflow
is avoided by the 64-bitness of the ->dyntick_nesting counter.

This commit also adds warnings if a non-idle task asks RCU to enter
idle state (and these checks will need some adjustment before applying
Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246).
In addition, validation of ->dynticks and ->dynticks_nesting is added.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# bdd4e85d 07-Jun-2011 Frederic Weisbecker <fweisbec@gmail.com>

sched: Isolate preempt counting in its own config option

Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec
of preempt count offset independently. So that the offset
can be updated by preempt_disable() and preempt_enable()
even without the need for CONFIG_PREEMPT beeing set.

This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working
with !CONFIG_PREEMPT where it currently doesn't detect
code that sleeps inside explicit preemption disabled
sections.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>


# 4ba8216c 25-Jan-2011 Arnd Bergmann <arnd@arndb.de>

BKL: That's all, folks

This removes the implementation of the big kernel lock,
at last. A lot of people have worked on this in the
past, I so the credit for this patch should be with
everyone who participated in the hunt.

The names on the Cc list are the people that were the
most active in this, according to the recorded git
history, in alphabetical order.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Alan Cox <alan@linux.intel.com>
Cc: Alessio Igor Bogani <abogani@texware.it>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Hendry <andrew.hendry@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jan Blunck <jblunck@infradead.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Oliver Neukum <oliver@neukum.org>
Cc: Paul Menage <menage@google.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>


# ed1d77b1 18-Nov-2010 Linus Torvalds <torvalds@linux-foundation.org>

hardirq.h: needs sched.h if using BKL

This really isn't the right thing to do, and strictly speaking we should
have the BKL depth count in the thread info right next to the preempt
count. The two really do go together.

However, since that would involve a patch to all architectures, and the
BKL is finally going away, it's simply not worth the effort to do the
RightThing(tm). Just re-instate the <linux/sched.h> include that we
used to get accidentally from the smp_lock.h one.

This is all fallout from the same old "BKL: remove extraneous #include
<smp_lock.h>" commit.

Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a5b871e 17-Nov-2010 Linus Torvalds <torvalds@linux-foundation.org>

hardirq.h: remove now-empty #ifdef/#endif pair

Commit 451a3c24b013 ("BKL: remove extraneous #include <smp_lock.h>")
removed the #include line that was the only thing that was surrounded by
the #ifdef/#endif.

So now that #ifdef is guarding nothing at all. Just remove it.

Reported-by: Byeong-ryeol Kim <brofkims@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7957f0a8 17-Nov-2010 Linus Torvalds <torvalds@linux-foundation.org>

Fix build failure due to hwirq.h needing smp_lock.h

Arnd Bergmann did an automated scripting run to find left-over instances
of <linux/smp_lock.h>, and had made it trigger it on the normal BKL use
of lock_kernel and unlock_lernel (and apparently release_kernel_lock and
reacquire_kernel_lock too, used by the scheduler).

That resulted in commit 451a3c24b013 ("BKL: remove extraneous #include
<smp_lock.h>").

However, hardirq.h was the only remaining user of the old
'kernel_locked()' interface, and Arnd's script hadn't checked for that.
So depending on your configuration and what header files had been
included, you would get errors like "implicit declaration of function
'kernel_locked'" during the build.

The right fix is not to just re-instate the smp_lock.h include - it is
to just remove 'kernel_locked()' entirely, since the only use was this
one special low-level detail. Just make hardirq.h do it directly.

In fact this simplifies and clarifies the code, because some trivial
analysis makes it clear that hardirq.h only ever used _one_ of the two
definitions of kernel_locked(), so we can remove the other one entirely.

Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com>
Reported-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 451a3c24 17-Nov-2010 Arnd Bergmann <arnd@arndb.de>

BKL: remove extraneous #include <smp_lock.h>

The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7fe19da4 28-Oct-2010 Arnd Bergmann <arnd@arndb.de>

preempt: fix kernel build with !CONFIG_BKL

The preempt count logic tries to take the BKL into account, which breaks
when CONFIG_BKL is not set.

Use the same preempt_count offset that we use without CONFIG_PREEMPT
when CONFIG_BKL is disabled.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-and-tested-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b52bfee4 04-Oct-2010 Venkatesh Pallipadi <venki@google.com>

sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time

s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.

Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.

This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.

This patch just adds the core logic and does not enable this logic yet.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# e1e10a26 04-Oct-2010 Venkatesh Pallipadi <venki@google.com>

sched: Consolidate account_system_vtime extern declaration

Just a minor cleanup patch that makes things easier to the following patches.
No functionality change in this patch.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 75e1056f 04-Oct-2010 Venkatesh Pallipadi <venki@google.com>

sched: Fix softirq time accounting

Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread:

http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html

The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.

Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.

Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.

Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.

Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# bcdb714c 07-Oct-2010 David Howells <dhowells@redhat.com>

Drop a couple of unnecessary asm/system.h inclusions

Drop inclusions of asm/system.h from linux/hardirq.h and linux/list.h as
they're no longer required and prevent the M68K arch's IRQ flag handling macros
from being made into inlined functions due to circular dependencies.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>


# a57eb940 29-Jun-2010 Paul E. McKenney <paulmck@kernel.org>

rcu: Add a TINY_PREEMPT_RCU

Implement a small-memory-footprint uniprocessor-only implementation of
preemptible RCU. This implementation uses but a single blocked-tasks
list rather than the combinatorial number used per leaf rcu_node by
TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies
processing. This version also takes advantage of uniprocessor execution
to accelerate grace periods in the case where there are no readers.

The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU.

This implementation is a step towards having RCU implementation driven
off of the SMP and PREEMPT kernel configuration variables, which can
happen once this implementation has accumulated sufficient experience.

Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as
suggested by Steve Rostedt in order to avoid the compiler-reordering
issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183).

As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte
savings compared to CONFIG_TREE_PREEMPT_RCU. Of course, for non-real-time
workloads, CONFIG_TINY_RCU is even better.

CONFIG_TREE_PREEMPT_RCU

text data bss dec filename
13 0 0 13 kernel/rcupdate.o
6170 825 28 7023 kernel/rcutree.o
----
7026 Total

CONFIG_TINY_PREEMPT_RCU

text data bss dec filename
13 0 0 13 kernel/rcupdate.o
2081 81 8 2170 kernel/rcutiny.o
----
2183 Total

CONFIG_TINY_RCU (non-preemptible)

text data bss dec filename
13 0 0 13 kernel/rcupdate.o
719 25 0 744 kernel/rcutiny.o
---
757 Total

Requested-by: Loïc Minier <loic.minier@canonical.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


# 9b1d82fa 25-Oct-2009 Paul E. McKenney <paulmck@kernel.org>

rcu: "Tiny RCU", The Bloatwatch Edition

This patch is a version of RCU designed for !SMP provided for a
small-footprint RCU implementation. In particular, the
implementation of synchronize_rcu() is extremely lightweight and
high performance. It passes rcutorture testing in each of the
four relevant configurations (combinations of NO_HZ and PREEMPT)
on x86. This saves about 1K bytes compared to old Classic RCU
(which is no longer in mainline), and more than three kilobytes
compared to Hierarchical RCU (updated to 2.6.30):

CONFIG_TREE_RCU:

text data bss dec filename
183 4 0 187 kernel/rcupdate.o
2783 520 36 3339 kernel/rcutree.o
3526 Total (vs 4565 for v7)

CONFIG_TREE_PREEMPT_RCU:

text data bss dec filename
263 4 0 267 kernel/rcupdate.o
4594 776 52 5422 kernel/rcutree.o
5689 Total (6155 for v7)

CONFIG_TINY_RCU:

text data bss dec filename
96 4 0 100 kernel/rcupdate.o
734 24 0 758 kernel/rcutiny.o
858 Total (vs 848 for v7)

The above is for x86. Your mileage may vary on other platforms.
Further compression is possible, but is being procrastinated.

Changes from v7 (http://lkml.org/lkml/2009/10/9/388)

o Apply Lai Jiangshan's review comments (aside from
might_sleep() in synchronize_sched(), which is covered by SMP builds).

o Fix up expedited primitives.

Changes from v6 (http://lkml.org/lkml/2009/9/23/293).

o Forward ported to put it into the 2.6.33 stream.

o Added lockdep support.

o Make lightweight rcu_barrier.

Changes from v5 (http://lkml.org/lkml/2009/6/23/12).

o Ported to latest pre-2.6.32 merge window kernel.

- Renamed rcu_qsctr_inc() to rcu_sched_qs().
- Renamed rcu_bh_qsctr_inc() to rcu_bh_qs().
- Provided trivial rcu_cpu_notify().
- Provided trivial exit_rcu().
- Provided trivial rcu_needs_cpu().
- Fixed up the rcu_*_enter/exit() functions in linux/hardirq.h.

o Removed the dependence on EMBEDDED, with a view to making
TINY_RCU default for !SMP at some time in the future.

o Added (trivial) support for expedited grace periods.

Changes from v4 (http://lkml.org/lkml/2009/5/2/91) include:

o Squeeze the size down a bit further by removing the
->completed field from struct rcu_ctrlblk.

o This permits synchronize_rcu() to become the empty function.
Previous concerns about rcutorture were unfounded, as
rcutorture correctly handles a constant value from
rcu_batches_completed() and rcu_batches_completed_bh().

Changes from v3 (http://lkml.org/lkml/2009/3/29/221) include:

o Changed rcu_batches_completed(), rcu_batches_completed_bh()
rcu_enter_nohz(), rcu_exit_nohz(), rcu_nmi_enter(), and
rcu_nmi_exit(), to be static inlines, as suggested by David
Howells. Doing this saves about 100 bytes from rcutiny.o.
(The numbers between v3 and this v4 of the patch are not directly
comparable, since they are against different versions of Linux.)

Changes from v2 (http://lkml.org/lkml/2009/2/3/333) include:

o Fix whitespace issues.

o Change short-circuit "||" operator to instead be "+" in order
to fix performance bug noted by "kraai" on LWN.

(http://lwn.net/Articles/324348/)

Changes from v1 (http://lkml.org/lkml/2009/1/13/440) include:

o This version depends on EMBEDDED as well as !SMP, as suggested
by Ingo.

o Updated rcu_needs_cpu() to unconditionally return zero,
permitting the CPU to enter dynticks-idle mode at any time.
This works because callbacks can be invoked upon entry to
dynticks-idle mode.

o Paul is now OK with this being included, based on a poll at
the Kernel Miniconf at linux.conf.au, where about ten people said
that they cared about saving 900 bytes on single-CPU systems.

o Applies to both mainline and tip/core/rcu.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: avi@redhat.com
Cc: mtosatti@redhat.com
LKML-Reference: <12565226351355-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# b560d8ad 21-Aug-2009 Paul E. McKenney <paulmck@kernel.org>

rcu: Expunge lingering references to CONFIG_CLASSIC_RCU, optimize on !SMP

A couple of references to CONFIG_CLASSIC_RCU have survived.
Although these are harmless, it is past time for them to go.
The one in hardirq.h is strictly a readability problem.

The two in pagemap.h appear to disable a !SMP performance
optimization (which this patch re-enables).

This does raise the issue as to whether pagemap.h should really
be referring to the CPU implementation. Long term, I intend to
make the RCU implementation driven by CONFIG_PREEMPT, at which
point these should change from defined(CONFIG_TREE_RCU) to
!defined(CONFIG_PREEMPT). In the meantime, is there something
else that could be done in pagemap.h?

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <20090822050851.GA8414@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 8e5b59a2 06-Aug-2009 Arnd Bergmann <arnd@arndb.de>

sched: Add default defines for PREEMPT_ACTIVE

The PREEMPT_ACTIVE setting doesn't actually need to be
arch-specific, so set up a sane default for all arches to
(hopefully) migrate to.

> if we look at linux/hardirq.h, it makes this claim:
> * - bit 28 is the PREEMPT_ACTIVE flag
> if that's true, then why are we letting any arch set this define ? a
> quick survey shows that half the arches (11) are using 0x10000000 (bit
> 28) while the other half (10) are using 0x4000000 (bit 26). and then
> there is the ia64 oddity which uses bit 30. the exact value here
> shouldnt really matter across arches though should it ?

actually alpha, arm and avr32 also use bit 30 (0x40000000),
there are only five (or eight, depending on how you count)
architectures (blackfin, h8300, m68k, s390 and sparc) using bit
26.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 405f5571 11-Jul-2009 Alexey Dobriyan <adobriyan@gmail.com>

headers: smp_lock.h redux

* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

This will make hardirq.h inclusion cheaper for every PREEMPT=n config
(which includes allmodconfig/allyesconfig, BTW)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3aa551c9 23-Mar-2009 Thomas Gleixner <tglx@linutronix.de>

genirq: add threaded interrupt handler support

Add support for threaded interrupt handlers:

A device driver can request that its main interrupt handler runs in a
thread. To achive this the device driver requests the interrupt with
request_threaded_irq() and provides additionally to the handler a
thread function. The handler function is called in hard interrupt
context and needs to check whether the interrupt originated from the
device. If the interrupt originated from the device then the handler
can either return IRQ_HANDLED or IRQ_WAKE_THREAD. IRQ_HANDLED is
returned when no further action is required. IRQ_WAKE_THREAD causes
the genirq code to invoke the threaded (main) handler. When
IRQ_WAKE_THREAD is returned handler must have disabled the interrupt
on the device level. This is mandatory for shared interrupt handlers,
but we need to do it as well for obscure x86 hardware where disabling
an interrupt on the IO_APIC level redirects the interrupt to the
legacy PIC interrupt lines.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>


# 2a7b8df0 12-Feb-2009 Steven Rostedt <srostedt@redhat.com>

sched: do not account for NMIs

Impact: avoid corruption in system time accounting

Martin Schwidefsky told me that there was an issue with NMIs and
system accounting. The problem is that the accounting code is
not reentrant, and if an NMI goes off after an interrupt it can
corrupt the accounting.

For now, the best we can do is to treat NMIs like SMIs and they
are not accounted for.

This patch changes nmi_enter to not call __irq_enter and to do
the preempt-count and tracing calls directly.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>


# 5a5fb7db 12-Feb-2009 Steven Rostedt <srostedt@redhat.com>

preempt-count: force hardirq-count to max of 10

To add a bit in the preempt_count to be set when in NMI context, we
found that some archs did not have enough bits to spare. This is
due to the hardirq_count being a mask that can hold NR_IRQS.

Some archs allow for over 16000 IRQs, and that would require a mask
of 14 bits. The sofitrq mask is 8 bits and the preempt disable mask
is also 8 bits. The PREEMP_ACTIVE bit is bit 30, and bit 31 would
make the preempt_count (which is type int) a negative number.
A negative preempt_count is a sign of failure.

Add them up 14+8+8+1+1 you get 32 bits. No room for the NMI bit.

But the hardirq_count is to track the number of nested IRQs, not
the number of total IRQs. This originally took the paranoid approach
of setting the max nesting to NR_IRQS. But when we have archs with
over 1000 IRQs, it is not practical to think they will ever all
nest on a single CPU. Not to mention that this would most definitely
cause a stack overflow.

This patch sets a max of 10 bits to be used for IRQ nesting.
I did a 'git grep HARDIRQ' to examine all users of HARDIRQ_BITS and
HARDIRQ_MASK, and found that making it a max of 10 would not hurt
anyone. I did find that the m68k expected it to be 8 bits, so
I allow for the archs to set the number to be less than 10.

I removed the setting of HARDIRQ_BITS from the archs that set it
to more than 10. This includes ALPHA, ia64 and avr32.

This will always allow room for the NMI bit, and if we need to allow
for NMI nesting, we have 4 bits to play with.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>


# 375b38b4 05-Feb-2009 Steven Rostedt <srostedt@redhat.com>

nmi: add generic nmi tracking state

This code adds an in_nmi() macro that uses the current tasks preempt count
to track when it is in NMI context. Other parts of the kernel can
use this to determine if the context is in NMI context or not.

This code was inspired by the -rt patch in_nmi version that was
written by Peter Zijlstra, who borrowed that code from
Mathieu Desnoyers.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>


# 64db4cff 18-Dec-2008 Paul E. McKenney <paulmck@kernel.org>

"Tree RCU": scalable classic RCU implementation

This patch fixes a long-standing performance bug in classic RCU that
results in massive internal-to-RCU lock contention on systems with
more than a few hundred CPUs. Although this patch creates a separate
flavor of RCU for ease of review and patch maintenance, it is intended
to replace classic RCU.

This patch still handles stress better than does mainline, so I am still
calling it ready for inclusion. This patch is against the -tip tree.
Nevertheless, experience on an actual 1000+ CPU machine would still be
most welcome.

Most of the changes noted below were found while creating an rcutiny
(which should permit ejecting the current rcuclassic) and while doing
detailed line-by-line documentation.

Updates from v9 (http://lkml.org/lkml/2008/12/2/334):

o Fixes from remainder of line-by-line code walkthrough,
including comment spelling, initialization, undesirable
narrowing due to type conversion, removing redundant memory
barriers, removing redundant local-variable initialization,
and removing redundant local variables.

I do not believe that any of these fixes address the CPU-hotplug
issues that Andi Kleen was seeing, but please do give it a whirl
in case the machine is smarter than I am.

A writeup from the walkthrough may be found at the following
URL, in case you are suffering from terminal insomnia or
masochism:

http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf

o Made rcutree tracing use seq_file, as suggested some time
ago by Lai Jiangshan.

o Added a .csv variant of the rcudata debugfs trace file, to allow
people having thousands of CPUs to drop the data into
a spreadsheet. Tested with oocalc and gnumeric. Updated
documentation to suit.

Updates from v8 (http://lkml.org/lkml/2008/11/15/139):

o Fix a theoretical race between grace-period initialization and
force_quiescent_state() that could occur if more than three
jiffies were required to carry out the grace-period
initialization. Which it might, if you had enough CPUs.

o Apply Ingo's printk-standardization patch.

o Substitute local variables for repeated accesses to global
variables.

o Fix comment misspellings and redundant (but harmless) increments
of ->n_rcu_pending (this latter after having explicitly added it).

o Apply checkpatch fixes.

Updates from v7 (http://lkml.org/lkml/2008/10/10/291):

o Fixed a number of problems noted by Gautham Shenoy, including
the cpu-stall-detection bug that he was having difficulty
convincing me was real. ;-)

o Changed cpu-stall detection to wait for ten seconds rather than
three in order to reduce false positive, as suggested by Ingo
Molnar.

o Produced a design document (http://lwn.net/Articles/305782/).
The act of writing this document uncovered a number of both
theoretical and "here and now" bugs as noted below.

o Fix dynticks_nesting accounting confusion, simplify WARN_ON()
condition, fix kerneldoc comments, and add memory barriers
in dynticks interface functions.

o Add more data to tracing.

o Remove unused "rcu_barrier" field from rcu_data structure.

o Count calls to rcu_pending() from scheduling-clock interrupt
to use as a surrogate timebase should jiffies stop counting.

o Fix a theoretical race between force_quiescent_state() and
grace-period initialization. Yes, initialization does have to
go on for some jiffies for this race to occur, but given enough
CPUs...

Updates from v6 (http://lkml.org/lkml/2008/9/23/448):

o Fix a number of checkpatch.pl complaints.

o Apply review comments from Ingo Molnar and Lai Jiangshan
on the stall-detection code.

o Fix several bugs in !CONFIG_SMP builds.

o Fix a misspelled config-parameter name so that RCU now announces
at boot time if stall detection is configured.

o Run tests on numerous combinations of configurations parameters,
which after the fixes above, now build and run correctly.

Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):

o Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
changeset some time ago, and finally got around to retesting
this option).

o Fix some tracing bugs in rcupreempt that caused incorrect
totals to be printed.

o I now test with a more brutal random-selection online/offline
script (attached). Probably more brutal than it needs to be
on the people reading it as well, but so it goes.

o A number of optimizations and usability improvements:

o Make rcu_pending() ignore the grace-period timeout when
there is no grace period in progress.

o Make force_quiescent_state() avoid going for a global
lock in the case where there is no grace period in
progress.

o Rearrange struct fields to improve struct layout.

o Make call_rcu() initiate a grace period if RCU was
idle, rather than waiting for the next scheduling
clock interrupt.

o Invoke rcu_irq_enter() and rcu_irq_exit() only when
idle, as suggested by Andi Kleen. I still don't
completely trust this change, and might back it out.

o Make CONFIG_RCU_TRACE be the single config variable
manipulated for all forms of RCU, instead of the prior
confusion.

o Document tracing files and formats for both rcupreempt
and rcutree.

Updates from v4 for those missing v5 given its bad subject line:

o Separated dynticks interface so that NMIs and irqs call separate
functions, greatly simplifying it. In particular, this code
no longer requires a proof of correctness. ;-)

o Separated dynticks state out into its own per-CPU structure,
avoiding the duplicated accounting.

o The case where a dynticks-idle CPU runs an irq handler that
invokes call_rcu() is now correctly handled, forcing that CPU
out of dynticks-idle mode.

o Review comments have been applied (thank you all!!!).
For but one example, fixed the dynticks-ordering issue that
Manfred pointed out, saving me much debugging. ;-)

o Adjusted rcuclassic and rcupreempt to handle dynticks changes.

Attached is an updated patch to Classic RCU that applies a hierarchy,
greatly reducing the contention on the top-level lock for large machines.
This passes 10-hour concurrent rcutorture and online-offline testing on
128-CPU ppc64 without dynticks enabled, and exposes some timekeeping
bugs in presence of dynticks (exciting working on a system where
"sleep 1" hangs until interrupted...), which were fixed in the
2.6.27 kernel. It is getting more reliable than mainline by some
measures, so the next version will be against -tip for inclusion.
See also Manfred Spraul's recent patches (or his earlier work from
2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
We will converge onto a common patch in the fullness of time, but are
currently exploring different regions of the design space. That said,
I have already gratefully stolen quite a few of Manfred's ideas.

This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on
64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
there is no hierarchy. By default, the RCU initialization code will
adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
this balancing, allowing the hierarchy to be exactly aligned to the
underlying hardware. Up to two levels of hierarchy are permitted
(in addition to the root node), allowing up to 16,384 CPUs on 32-bit
systems and up to 262,144 CPUs on 64-bit systems. I just know that I
am going to regret saying this, but this seems more than sufficient
for the foreseeable future. (Some architectures might wish to set
CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
If this becomes a real problem, additional levels can be added, but I
doubt that it will make a significant difference on real hardware.)

In the common case, a given CPU will manipulate its private rcu_data
structure and the rcu_node structure that it shares with its immediate
neighbors. This can reduce both lock and memory contention by multiple
orders of magnitude, which should eliminate the need for the strange
manipulations that are reported to be required when running Linux on
very large systems.

Some shortcomings:

o More bugs will probably surface as a result of an ongoing
line-by-line code inspection.

Patches will be provided as required.

o There are probably hangs, rcutorture failures, &c. Seems
quite stable on a 128-CPU machine, but that is kind of small
compared to 4096 CPUs. However, seems to do better than
mainline.

Patches will be provided as required.

o The memory footprint of this version is several KB larger
than rcuclassic.

A separate UP-only rcutiny patch will be provided, which will
reduce the memory footprint significantly, even compared
to the old rcuclassic. One such patch passes light testing,
and has a memory footprint smaller even than rcuclassic.
Initial reaction from various embedded guys was "it is not
worth it", so am putting it aside.

Credits:

o Manfred Spraul for ideas, review comments, and bugs spotted,
as well as some good friendly competition. ;-)

o Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers,
Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton
for reviews and comments.

o Thomas Gleixner for much-needed help with some timer issues
(see patches below).

o Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos,
Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton
Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines
alive despite my heavy abuse^Wtesting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 6a60dd12 06-Nov-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: split out hardirq ftrace code into own header

Impact: moving of function prototypes into own header file

ftrace.h is too big of a file for hardirq.h, and some archs will fail
to build because of the include dependencies not being met.

This patch pulls out the required prototypes for hardirq.h into a smaller
and safer ftrace_irq.h file.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 7e5e26a3 31-Oct-2008 Steven Rostedt <rostedt@goodmis.org>

ftrace: fix hardirq header for non ftrace archs

Impact: build fix for non-ftrace architectures

Not all archs implement ftrace, and therefore do not have an asm/ftrace.h.
This patch corrects the problem.

The ftrace_nmi_enter/exit now must be defined for all archs that implement
dynamic ftrace. Currently, only x86 does.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 17666f02 30-Oct-2008 Steven Rostedt <rostedt@goodmis.org>

ftrace: nmi safe code modification

Impact: fix crashes that can occur in NMI handlers, if their code is modified

Modifying code is something that needs special care. On SMP boxes,
if code that is being modified is also being executed on another CPU,
that CPU will have undefined results.

The dynamic ftrace uses kstop_machine to make the system act like a
uniprocessor system. But this does not address NMIs, that can still
run on other CPUs.

One approach to handle this is to make all code that are used by NMIs
not be traced. But NMIs can call notifiers that spread throughout the
kernel and this will be very hard to maintain, and the chance of missing
a function is very high.

The approach that this patch takes is to have the NMIs modify the code
if the modification is taking place. The way this works is that just
writing to code executing on another CPU is not harmful if what is
written is the same as what exists.

Two buffers are used: an IP buffer and a "code" buffer.

The steps that the patcher takes are:

1) Put in the instruction pointer into the IP buffer
and the new code into the "code" buffer.
2) Set a flag that says we are modifying code
3) Wait for any running NMIs to finish.
4) Write the code
5) clear the flag.
6) Wait for any running NMIs to finish.

If an NMI is executed, it will also write the pending code.
Multiple writes are OK, because what is being written is the same.
Then the patcher must wait for all running NMIs to finish before
going to the next line that must be patched.

This is basically the RCU approach to code modification.

Thanks to Ingo Molnar for suggesting the idea, and to Arjan van de Ven
for his guidence on what is safe and what is not.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 8e3e076c 10-May-2008 Linus Torvalds <torvalds@linux-foundation.org>

BKL: revert back to the old spinlock implementation

The generic semaphore rewrite had a huge performance regression on AIM7
(and potentially other BKL-heavy benchmarks) because the generic
semaphores had been rewritten to be simple to understand and fair. The
latter, in particular, turns a semaphore-based BKL implementation into a
mess of scheduling.

The attempt to fix the performance regression failed miserably (see the
previous commit 00b41ec2611dc98f87f30753ee00a53db648d662 'Revert
"semaphore: fix"'), and so for now the simple and sane approach is to
instead just go back to the old spinlock-based BKL implementation that
never had any issues like this.

This patch also has the advantage of being reported to fix the
regression completely according to Yanmin Zhang, unlike the semaphore
hack which still left a couple percentage point regression.

As a spinlock, the BKL obviously has the potential to be a latency
issue, but it's not really any different from any other spinlock in that
respect. We do want to get rid of the BKL asap, but that has been the
plan for several years.

These days, the biggest users are in the tty layer (open/release in
particular) and Alan holds out some hope:

"tty release is probably a few months away from getting cured - I'm
afraid it will almost certainly be the very last user of the BKL in
tty to get fixed as it depends on everything else being sanely locked."

so while we're not there yet, we do have a plan of action.

Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8c703d35 28-Mar-2008 Jonathan Corbet <corbet@lwn.net>

in_atomic(): document why it is unsuitable for general use

Discourage people from inappropriately using in_atomic()

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2232c2d8 29-Feb-2008 Steven Rostedt <rostedt@goodmis.org>

rcu: add support for dynamic ticks and preempt rcu

The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The
idle CPU will not progress the RCU through its grace period and a
synchronize_rcu my get stuck. Without this patch I have a box that will
not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine
with this patch.

This patch comes from the -rt kernel where it has been tested for
several months.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 6478d880 25-Jan-2008 Ingo Molnar <mingo@elte.hu>

sched: remove the !PREEMPT_BKL code

remove the !PREEMPT_BKL code.

this removes 160 lines of legacy code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 4da1ce6d 09-Jul-2007 Ingo Molnar <mingo@elte.hu>

sched: add in_atomic_preempt_off()

add in_atomic_preempt_off() - debugging helper that will
simplify schedule().

Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 79bf2bb3 16-Feb-2007 Thomas Gleixner <tglx@linutronix.de>

[PATCH] tick-management: dyntick / highres functionality

With Ingo Molnar <mingo@elte.hu>

Add functions to provide dynamic ticks and high resolution timers. The code
which keeps track of jiffies and handles the long idle periods is shared
between tick based and high resolution timer based dynticks. The dyntick
functionality can be disabled on the kernel commandline. Provide also the
infrastructure to support high resolution timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dde4b2b5 16-Feb-2007 Ingo Molnar <mingo@elte.hu>

[PATCH] uninline irq_enter()

Uninline irq_enter(). [dynticks adds more stuff to it]

No functional changes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23d0b8b0 04-Oct-2006 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] genirq: irq: generalize the check for HARDIRQ_BITS

This patch adds support for systems that cannot receive every interrupt on a
single cpu simultaneously, in the check to see if we have enough HARDIRQ_BITS.

MAX_HARDIRQS_PER_CPU becomes the count of the maximum number of hardare
generated interrupts per cpu.

On architectures that support per cpu interrupt delivery this can be a
significant space savings and scalability bonus.

This patch adds support for systems that cannot receive every interrupt on

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rajesh Shah <rajesh.shah@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fbb9ce95 03-Jul-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] lockdep: core

Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.

Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.

What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.

When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]

Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.

To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.

To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:

lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176

The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.

More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:

http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt

[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# de30a2b3 03-Jul-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] lockdep: irqtrace subsystem, core

Accurate hard-IRQ-flags and softirq-flags state tracing.

This allows us to attach extra functionality to IRQ flags on/off
events (such as trace-on/off).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 62c4f0a2 25-Apr-2006 David Woodhouse <dwmw2@infradead.org>

Don't include linux/config.h from anywhere else in include/

Signed-off-by: David Woodhouse <dwmw2@infradead.org>


# 1f1c12af 14-Jan-2006 Martin Schwidefsky <schwidefsky@de.ibm.com>

[PATCH] s390: cputime misaccounting

finish_arch_switch needs to update the user cpu time as well, not just the
system cpu time. Otherwise the partial user cpu time of a process that is
stored in the lowcore will be (mis-)accounted to the next process.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f037360f 13-Nov-2005 Al Viro <viro@parcelfarce.linux.theplanet.co.uk>

[PATCH] m68k: thread_info header cleanup

a) in smp_lock.h #include of sched.h and spinlock.h moved under #ifdef
CONFIG_LOCK_KERNEL.

b) interrupt.h now explicitly pulls sched.h (not via smp_lock.h from
hardirq.h as it used to)

c) in three more places we need changes to compensate for (a) - one place
in arch/sparc needs string.h now, hardirq.h needs forward declaration of
task_struct and preempt.h needs direct include of thread_info.h.

d) thread_info-related helpers in sched.h and thread_info.h put under
ifndef __HAVE_THREAD_FUNCTIONS. Obviously safe.

Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 67bc4eb0 12-Jul-2005 Randy Dunlap <rdunlap@infradead.org>

[PATCH] hardirq uses preempt

hardirq.h uses preempt_count() from preempt.h

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8f28e8fa 28-May-2005 Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

[PATCH] irq code: Add coherence test for PREEMPT_ACTIVE

After porting this fixlet to UML:

http://linux.bkbits.net:8080/linux-2.5/cset@41791ab52lfMuF2i3V-eTIGRBbDYKQ

, I've also added a warning which should refuse compilation with insane values
for PREEMPT_ACTIVE... maybe we should simply move PREEMPT_ACTIVE out of
architectures using GENERIC_IRQS.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!