History log of /linux-master/arch/powerpc/kernel/smp.c
Revision Date Author Comments
# 5c4233cc 26-Feb-2024 Hari Bathini <hbathini@linux.ibm.com>

powerpc/kdump: Split KEXEC_CORE and CRASH_DUMP dependency

Remove CONFIG_CRASH_DUMP dependency on CONFIG_KEXEC. CONFIG_KEXEC_CORE
was used at places where CONFIG_CRASH_DUMP or CONFIG_CRASH_RESERVE was
appropriate. Replace with appropriate #ifdefs to support CONFIG_KEXEC
and !CONFIG_CRASH_DUMP configuration option. Also, make CONFIG_FA_DUMP
dependent on CONFIG_CRASH_DUMP to avoid unmet dependencies for FA_DUMP
with !CONFIG_KEXEC_CORE configuration option.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240226103010.589537-4-hbathini@linux.ibm.com


# 54de4427 10-Feb-2024 Alex Shi <alexs@kernel.org>

sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC

SD_SHARE_PKG_RESOURCES is a bit of a misnomer: its naming suggests that
it's sharing all 'package resources' - while in reality it's specifically
for sharing the LLC only.

Rename it to SD_SHARE_LLC to reduce confusion.

[ mingo: Rewrote the confusing changelog as well. ]

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Link: https://lore.kernel.org/r/20240210113924.1130448-5-alexs@kernel.org


# c4697571 14-Dec-2023 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Dynamically build Powerpc topology

Currently there are four Powerpc specific sched topologies. These are
all statically defined. However not all these topologies are used by
all Powerpc systems.

To avoid unnecessary degenerations by the scheduler, masks and flags
are compared. However if the sched topologies are build dynamically then
the code is simpler and there are greater chances of avoiding
degenerations.

Note:
Even X86 builds its sched topologies dynamically and proposed changes
are very similar to the way X86 is building its topologies.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231214180720.310852-6-srikar@linux.vnet.ibm.com


# 0e93f1c7 14-Dec-2023 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Avoid asym packing within thread_group of a core

PowerVM Hypervisor will schedule at a core granularity. However each
core can have more than one thread_groups. For better utilization in
case of a shared processor, its preferable for the scheduler to pack to
the lowest core. However there is no benefit of moving a thread between
two thread groups of the same core.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231214180720.310852-5-srikar@linux.vnet.ibm.com


# fd535a85 14-Dec-2023 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Add __ro_after_init attribute

There are some variables that are only updated at boot time.
So add __ro_after_init attribute to such variables

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231214180720.310852-4-srikar@linux.vnet.ibm.com


# 0e1c1986 14-Dec-2023 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Disable MC domain for shared processor

Like L2-cache info, coregroup information which is used to determine MC
sched domains is only present on dedicated LPARs. i.e PowerVM doesn't
export coregroup information for shared processor LPARs. Hence disable
creating MC domains on shared LPAR Systems.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231214180720.310852-3-srikar@linux.vnet.ibm.com


# aa80c634 14-Dec-2023 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Enable Asym packing for cores on shared processor

If there are shared processor LPARs, underlying Hypervisor can have more
virtual cores to handle than actual physical cores.

Starting with Power 9, a big core (aka SMT8 core) has 2 nearly
independent thread groups. On a shared processors LPARs, it helps to
pack threads to lesser number of cores so that the overall system
performance and utilization improves. PowerVM schedules at a big core
level. Hence packing to fewer cores helps.

Since each thread-group is independent, running threads on both the
thread-groups of a SMT8 core, should have a minimal adverse impact in
non over provisioned scenarios. These changes in this patchset will not
affect in the over provisioned scenario. If there are more threads than
SMT domains, then asym_packing will not kick-in

For example: Lets says there are two 8-core Shared LPARs that are
actually sharing a 8 Core shared physical pool, each running 8 threads
each. Then Consolidating 8 threads to 4 cores on each LPAR would help
them to perform better. This is because each of the LPAR will get
100% time to run applications and there will no switching required by
the Hypervisor.

To achieve this, enable SD_ASYM_PACKING flag at CACHE, MC and DIE level
when the system is running in shared processor mode and has big cores.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231214180720.310852-2-srikar@linux.vnet.ibm.com


# f577cd57 12-Jul-2023 Peter Zijlstra <peterz@infradead.org>

sched/topology: Rename 'DIE' domain to 'PKG'

While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain
for the package mask. :-)

Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the
name less ambiguous.

[ Shrikanth Hegde: rename on s390 as well. ]
[ Valentin Schneider: also rename it in the comments. ]
[ mingo: port to recent kernels & find all remaining occurances. ]

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net


# 448e9f34 08-Sep-2023 Frederic Weisbecker <frederic@kernel.org>

rcu: Standardize explicit CPU-hotplug calls

rcu_report_dead() and rcutree_migrate_callbacks() have their headers in
rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug
functions.

Also rcu_cpu_starting() and rcu_report_dead() have different naming
conventions while they mirror each other's effects.

Fix the headers and propose a naming that relates both functions and
aligns with the prefix of other rcutree CPU-hotplug functions.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>


# 73c58e7e 05-Jul-2023 Michael Ellerman <mpe@ellerman.id.au>

powerpc: Add HOTPLUG_SMT support

Add support for HOTPLUG_SMT, which enables the generic sysfs SMT support
files in /sys/devices/system/cpu/smt, as well as the "nosmt" boot
parameter.

Implement the recently added hooks to allow partial SMT states, allow
any number of threads per core.

Tie the config symbol to HOTPLUG_CPU, which enables it on the major
platforms that support SMT. If there are other platforms that want the
SMT support that can be tweaked in future.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[ldufour: remove topology_smt_supported]
[ldufour: remove topology_smt_threads_supported]
[ldufour: select CONFIG_SMT_NUM_THREADS_DYNAMIC]
[ldufour: update kernel-parameters.txt]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Link: https://msgid.link/20230705145143.40545-10-ldufour@linux.ibm.com


# c3c2e937 24-May-2023 Nicholas Piggin <npiggin@gmail.com>

powerpc: Account mm_cpumask and active_cpus in init_mm

init_mm mm_cpumask and context.active_cpus is not maintained at boot
and hotplug. This seems to be harmless because init_mm does not have a
userspace and so never gets user TLBs flushed, but it looks odd and it
prevents some sanity checks being added.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230524060821.148015-2-npiggin@gmail.com


# 8608f14b 09-May-2023 Rohan McLure <rmclure@linux.ibm.com>

powerpc: Annotate accesses to ipi message flags

IPI message flags are observed and consequently consumed in the
smp_ipi_demux_relaxed function, which handles these message sources
until it observes none more arriving. Mark the checked loop guard with
READ_ONCE, to signal to KCSAN that the read is known to be volatile, and
that non-determinism is expected. Mark write for message source in
smp_muxed_ipi_set_message().

Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230510033117.1395895-8-rmclure@linux.ibm.com


# 0f613bfa 05-Jun-2023 Mark Rutland <mark.rutland@arm.com>

locking/atomic: treewide: use raw_atomic*_<op>()

Now that we have raw_atomic*_<op>() definitions, there's no need to use
arch_atomic*_<op>() definitions outside of the low-level atomic
definitions.

Move treewide users of arch_atomic*_<op>() over to the equivalent
raw_atomic*_<op>().

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com


# 514ca14e 17-Apr-2023 ndesaulniers@google.com <ndesaulniers@google.com>

start_kernel: Add __no_stack_protector function attribute

Back during the discussion of
commit a9a3ed1eff36 ("x86: Fix early boot crash on gcc-10, third try")
we discussed the need for a function attribute to control the omission
of stack protectors on a per-function basis; at the time Clang had
support for no_stack_protector but GCC did not. This was fixed in
gcc-11. Now that the function attribute is available, let's start using
it.

Callers of boot_init_stack_canary need to use this function attribute
unless they're compiled with -fno-stack-protector, otherwise the canary
stored in the stack slot of the caller will differ upon the call to
boot_init_stack_canary. This will lead to a call to __stack_chk_fail()
then panic.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94722
Link: https://lore.kernel.org/all/20200316130414.GC12561@hirez.programming.kicks-ass.net/
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20230412-no_stackp-v2-1-116f9fe4bbe7@google.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

Signed-off-by: ndesaulniers@google.com <ndesaulniers@google.com>


# aa464ba9 03-Feb-2023 Nicholas Piggin <npiggin@gmail.com>

lazy tlb: introduce lazy tlb mm refcount helper functions

Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes. There is no
functional change with this patch.

Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4c8c3c7f 07-Mar-2023 Valentin Schneider <vschneid@redhat.com>

treewide: Trace IPIs sent via smp_send_reschedule()

To be able to trace invocations of smp_send_reschedule(), rename the
arch-specific definitions of it to arch_smp_send_reschedule() and wrap it
into an smp_send_reschedule() that contains a tracepoint.

Changes to include the declaration of the tracepoint were driven by the
following coccinelle script:

@func_use@
@@
smp_send_reschedule(...);

@include@
@@
#include <trace/events/ipi.h>

@no_include depends on func_use && !include@
@@
#include <...>
+
+ #include <trace/events/ipi.h>

[csky bits]
[riscv bits]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com


# 071c44e4 14-Feb-2023 Josh Poimboeuf <jpoimboe@kernel.org>

sched/idle: Mark arch_cpu_idle_dead() __noreturn

Before commit 076cbf5d2163 ("x86/xen: don't let xen_pv_play_dead()
return"), in Xen, when a previously offlined CPU was brought back
online, it unexpectedly resumed execution where it left off in the
middle of the idle loop.

There were some hacks to make that work, but the behavior was surprising
as do_idle() doesn't expect an offlined CPU to return from the dead (in
arch_cpu_idle_dead()).

Now that Xen has been fixed, and the arch-specific implementations of
arch_cpu_idle_dead() also don't return, give it a __noreturn attribute.

This will cause the compiler to complain if an arch-specific
implementation might return. It also improves code generation for both
caller and callee.

Also fixes the following warning:

vmlinux.o: warning: objtool: do_idle+0x25f: unreachable instruction

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/60d527353da8c99d4cf13b6473131d46719ed16d.1676358308.git.jpoimboe@kernel.org
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>


# 90f1b431 27-Nov-2022 Nicholas Piggin <npiggin@gmail.com>

powerpc: allow minimum sized kernel stack frames

This affects only 64-bit ELFv2 kernels, and reduces the minimum
asm-created stack frame size from 112 to 32 byte on those kernels.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221127124942.1665522-16-npiggin@gmail.com


# b37ac189 26-Sep-2022 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/smp: poll cpu_callin_map more aggressively in __cpu_up()

At boot time, it is not necessary to delay between polls of
cpu_callin_map when waiting for a kicked CPU to come up. Remove the
delay intervals, but preserve the overall deadline (five seconds).

At run time, the first poll result is usually negative and we incur a
sleeping wait. If we spin on the callin word for a short time first,
we can reduce __cpu_up() from dozens of milliseconds to under 1ms in
the common case on a P9 LPAR:

$ ppc64_cpu --smt=off
$ bpftrace -e 'kprobe:__cpu_up {
@start[tid] = nsecs;
}
kretprobe:__cpu_up /@start[tid]/ {
@us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}' -c 'ppc64_cpu --smt=on'

Before:

@us:
[16K, 32K) 85 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 13 |@@@@@@@ |

After:

@us:
[128, 256) 95 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512) 3 |@ |

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926220250.157022-1-nathanl@linux.ibm.com


# 3e731858 19-Sep-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Remove CONFIG_PPC_FSL_BOOK3E

CONFIG_PPC_FSL_BOOK3E is redundant with CONFIG_PPC_E500.

Remove it.

And rename five files accordingly.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Rename include guards to match new file names]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/795cb93b88c9a0279289712e674f39e3b108a1b4.1663606876.git.christophe.leroy@csgroup.eu


# 787dbea1 21-Jul-2022 Ben Dooks <ben-linux@fluff.org>

profile: setup_profiling_timer() is moslty not implemented

The setup_profiling_timer() is mostly un-implemented by many
architectures. In many places it isn't guarded by CONFIG_PROFILE which is
needed for it to be used. Make it a weak symbol in kernel/profile.c and
remove the 'return -EINVAL' implementations from the kenrel.

There are a couple of architectures which do return 0 from the
setup_profiling_timer() function but they don't seem to do anything else
with it. To keep the /proc compatibility for now, leave these for a
future update or removal.

On ARM, this fixes the following sparse warning:
arch/arm/kernel/smp.c:793:5: warning: symbol 'setup_profiling_timer' was not declared. Should it be static?

Link: https://lkml.kernel.org/r/20220721195509.418205-1-ben-linux@fluff.org
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c7255058 29-Jun-2022 Hari Bathini <hbathini@linux.ibm.com>

powerpc/crash: save cpu register data in crash_smp_send_stop()

During kdump, two set of NMI IPIs are sent to secondary CPUs, if
'crash_kexec_post_notifiers' option is set. The first set of NMI IPIs
to stop the CPUs and the other set to collect register data. Instead,
capture register data for secondary CPUs while stopping them itself.
Also, fallback to smp_send_stop() in case the function gets called
without kdump configured.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220630064942.192283-1-hbathini@linux.ibm.com


# 87c78b61 18-May-2022 Michael Ellerman <mpe@ellerman.id.au>

powerpc: Fix all occurences of "the the"

Rather than waiting for the bots to fix these one-by-one, fix all
occurences of "the the" throughout arch/powerpc.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220518142629.513007-1-mpe@ellerman.id.au


# 5352090a 18-May-2022 Daniel Axtens <dja@axtens.net>

powerpc/kasan: Don't instrument non-maskable or raw interrupts

Disable address sanitization for raw and non-maskable interrupt
handlers, because they can run in real mode, where we cannot access
the shadow memory. (Note that kasan_arch_is_ready() doesn't test for
real mode, since it is a static branch for speed, and in any case not
all the entry points to the generic KASAN code are protected by
kasan_arch_is_ready guards.)

The changes to interrupt_nmi_enter/exit_prepare() look larger than
they actually are. The changes are equivalent to adding
!IS_ENABLED(CONFIG_KASAN) to the conditions for calling nmi_enter() or
nmi_exit() in real mode. That is, the code is equivalent to using the
following condition for calling nmi_enter/exit:

if (((!IS_ENABLED(CONFIG_PPC_BOOK3S_64) ||
!firmware_has_feature(FW_FEATURE_LPAR) ||
radix_enabled()) &&
!IS_ENABLED(CONFIG_KASAN) ||
(mfmsr() & MSR_DR))

That unwieldy condition has been split into several statements with
comments, for easier reading.

The nmi_ipi_lock functions that call atomic functions (i.e.,
nmi_ipi_lock_start(), nmi_ipi_lock() and nmi_ipi_unlock()), besides
being marked noinstr, now call arch_atomic_* functions instead of
atomic_* functions because with KASAN enabled, the atomic_* functions
are wrappers which explicitly do address sanitization on their
arguments. Since we are trying to avoid address sanitization, we have
to use the lower-level arch_atomic_* versions.

In hv_nmi_check_nonrecoverable(), the regs_set_unrecoverable() call
has been open-coded so as to avoid having to either trust the inlining
or mark regs_set_unrecoverable() as noinstr.

[paulus@ozlabs.org: combined a few work-in-progress commits of
Daniel's and wrote the commit message.]

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/YoTFGaKM8Pd46PIK@cleo


# 86c38fec 08-Mar-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Remove asm/prom.h from all files that don't need it

Several files include asm/prom.h for no reason.

Clean it up.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Drop change to prom_parse.c as reported by lkp@intel.com]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7c9b8fda63dcf63e1b28f43e7ebdb95182cbc286.1646767214.git.christophe.leroy@csgroup.eu


# 1fd02f66 30-Apr-2022 Julia Lawall <Julia.Lawall@inria.fr>

powerpc: fix typos in comments

Various spelling mistakes in comments.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220430185654.5855-1-Julia.Lawall@inria.fr


# 76222808 04-Mar-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Move C prototypes out of asm-prototypes.h

We originally added asm-prototypes.h in commit 42f5b4cacd78 ("powerpc:
Introduce asm-prototypes.h"). It's purpose was for prototypes of C
functions that are only called from asm, in order to fix sparse
warnings about missing prototypes.

A few months later Nick added a different use case in
commit 4efca4ed05cb ("kbuild: modversions for EXPORT_SYMBOL() for asm")
for C prototypes for exported asm functions. This is basically the
inverse of our original usage.

Since then we've added various prototypes to asm-prototypes.h for both
reasons, meaning we now need to unstitch it all.

Dispatch prototypes of C functions into relevant headers and keep
only the prototypes for functions defined in assembly.

For the time being, leave prom_init() there because moving it
into asm/prom.h or asm/setup.h conflicts with
drivers/gpu/drm/nouveau/nvkm/subdev/bios/shadowrom.o
This will be fixed later by untaggling asm/pci.h and asm/prom.h
or by renaming the function in shadowrom.c

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/62d46904eca74042097acf4cb12c175e3067f3d1.1646413435.git.christophe.leroy@csgroup.eu


# e15c703b 04-Mar-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc/smp: Declare current_set static

current_set extern not needed anymore since
commit eafd825ed710 ("powerpc/64: Simplify __secondary_start
paca->kstack handling")

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a55eb65c9d7319f0af3c31e3f6ba36522f10003d.1646413435.git.christophe.leroy@csgroup.eu


# d276960d 16-Dec-2021 Nick Child <nick.child@ibm.com>

powerpc/kernel: Add __init attribute to eligible functions

Some functions defined in `arch/powerpc/kernel` (and one in `arch/powerpc/
kexec`) are deserving of an `__init` macro attribute. These functions are
only called by other initialization functions and therefore should inherit
the attribute.
Also, change function declarations in header files to include `__init`.

Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-2-nick.child@ibm.com


# 06e629c2 07-Dec-2021 Hari Bathini <hbathini@linux.ibm.com>

powerpc/fadump: Fix inaccurate CPU state info in vmcore generated with panic

In panic path, fadump is triggered via a panic notifier function.
Before calling panic notifier functions, smp_send_stop() gets called,
which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
("powerpc: stop_this_cpu: remove the cpu from the online map.") and
again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
started marking CPUs as offline while stopping them. So, if a kernel
has either of the above commits, vmcore captured with fadump via panic
path would not process register data for all CPUs except the panic'ing
CPU. Sample output of crash-utility with such vmcore:

# crash vmlinux vmcore
...
KERNEL: vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 1
DATE: Wed Nov 10 09:56:34 EST 2021
UPTIME: 00:00:42
LOAD AVERAGE: 2.27, 0.69, 0.24
TASKS: 183
NODENAME: XXXXXXXXX
RELEASE: 5.15.0+
VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
MACHINE: ppc64le (2500 Mhz)
MEMORY: 8 GB
PANIC: "Kernel panic - not syncing: sysrq triggered crash"
PID: 3394
COMMAND: "bash"
TASK: c0000000150a5f80 [THREAD_INFO: c0000000150a5f80]
CPU: 1
STATE: TASK_RUNNING (PANIC)

crash> p -x __cpu_online_mask
__cpu_online_mask = $1 = {
bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
crash>
crash>
crash> p -x __cpu_active_mask
__cpu_active_mask = $2 = {
bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
crash>

While this has been the case since fadump was introduced, the issue
was not identified for two probable reasons:

- In general, the bulk of the vmcores analyzed were from crash
due to exception.

- The above did change since commit 8341f2f222d7 ("sysrq: Use
panic() to force a crash") started using panic() instead of
deferencing NULL pointer to force a kernel crash. But then
commit de6e5d38417e ("powerpc: smp_send_stop do not offline
stopped CPUs") stopped marking CPUs as offline till kernel
commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
reverted that change.

To ensure post processing register data of all other CPUs happens
as intended, let panic() function take the crash friendly path (read
crash_smp_send_stop()) with the help of crash_kexec_post_notifiers
option. Also, as register data for all CPUs is captured by f/w, skip
IPI callbacks here for fadump, to avoid any complications in finding
the right backtraces.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211207103719.91117-2-hbathini@linux.ibm.com


# 219572d2 07-Dec-2021 Hari Bathini <hbathini@linux.ibm.com>

powerpc: handle kdump appropriately with crash_kexec_post_notifiers option

Kdump can be triggered after panic_notifers since commit f06e5153f4ae2
("kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump
after panic_notifers") introduced crash_kexec_post_notifiers option.
But using this option would mean smp_send_stop(), that marks all other
CPUs as offline, gets called before kdump is triggered. As a result,
kdump routines fail to save other CPUs' registers. To fix this, kdump
friendly crash_smp_send_stop() function was introduced with kernel
commit 0ee59413c967 ("x86/panic: replace smp_send_stop() with kdump
friendly version in panic path"). Override this kdump friendly weak
function to handle crash_kexec_post_notifiers option appropriately
on powerpc.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
[Fixed signature of crash_stop_this_cpu() - reported by lkp@intel.com]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211207103719.91117-1-hbathini@linux.ibm.com


# a4ac0d24 24-Nov-2021 Michael Ellerman <mpe@ellerman.id.au>

powerpc/smp: Move setup_profiling_timer() under CONFIG_PROFILING

setup_profiling_timer() is only needed when CONFIG_PROFILING is enabled.

Fixes the following W=1 warning when CONFIG_PROFILING=n:
linux/arch/powerpc/kernel/smp.c:1638:5: error: no previous prototype for ‘setup_profiling_timer’

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211124093254.1054750-5-mpe@ellerman.id.au


# 41408b22 06-Oct-2021 Rob Herring <robh@kernel.org>

powerpc: Use of_get_cpu_hwid()

Replace open coded parsing of CPU nodes' 'reg' property with
of_get_cpu_hwid().

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211006164332.1981454-8-robh@kernel.org


# bcf9033e 14-Sep-2021 Ard Biesheuvel <ardb@kernel.org>

sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y

THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this
causes some issues on architectures that define raw_smp_processor_id()
in terms of this field, due to the fact that #include'ing linux/sched.h
to get at struct task_struct is problematic in terms of circular
dependencies.

Given that thread_info and task_struct are the same data structure
anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having
access to the type definition of struct thread_info is sufficient to
reference the CPU number of the current task.

Note that this requires THREAD_INFO_IN_TASK's definition of the
task_thread_info() helper to be updated, as task_cpu() takes a
pointer-to-const, whereas task_thread_info() (which is used to generate
lvalues as well), needs a non-const pointer. So make it a macro instead.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>


# 787252a1 14-Oct-2021 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/smp: do not decrement idle task preempt count in CPU offline

With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:

BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
Call Trace:
dump_stack_lvl+0xac/0x108
__schedule_bug+0xac/0xe0
__schedule+0xcf8/0x10d0
schedule_idle+0x3c/0x70
do_idle+0x2d8/0x4a0
cpu_startup_entry+0x38/0x40
start_secondary+0x2ec/0x3a0
start_secondary_prolog+0x10/0x14

This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8279d2 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."

However, since commit 2c669ef6979c ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a376ca0c ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.

The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.

Tested with pseries and powernv in qemu, and pseries on PowerVM.

Fixes: 2c669ef6979c ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com


# 9a245d0e 26-Aug-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Update cpu_cpu_map on CPU online/offline

cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated. This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP: c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec
REGS: c00000005596f220 TRAP: 0700 Not tainted (5.13.0-rc6+)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48828222 XER:
00000009
CFAR: c0000000001ea698 IRQMASK: 0
GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036
GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90
GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8
GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800
GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18
GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400
GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060
GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d
NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720
LR [c0000000001caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable)
[c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70
[c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10
[c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590
[c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620
[c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0
[c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100521.412639-5-srikar@linux.vnet.ibm.com


# 5bf63497 26-Aug-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Enable CACHE domain for shared processor

Currently CACHE domain is not enabled on shared processor mode PowerVM
LPARS. On PowerVM systems, 'ibm,thread-group' device-tree property 2
under cpu-device-node indicates which all CPUs share L2-cache. However
'ibm,thread-group' device-tree property 2 is a relatively new property.
In absence of 'ibm,thread-group' property 2, 'l2-cache' device property
under cpu-device-node could help system to identify CPUs sharing L2-cache.
However this property is not exposed by PhyP in shared processor mode
configurations.

In absence of properties that inform OS about which CPUs share L2-cache,
fallback on core boundary.

Here are some stats from Power9 shared LPAR with the changes.

$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 3
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
NUMA node0 CPU(s): 16-23
NUMA node1 CPU(s): 0-15,24-31
Physical sockets: 2
Physical chips: 1
Physical cores/chip: 10

Before patch
$ grep -r . /sys/kernel/debug/sched/domains/cpu0/domain*/name
Before
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA

After
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA

$ awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
Before
domain0 00000055
domain0 000000aa
domain0 00005500
domain0 0000aa00
domain0 00550000
domain0 00aa0000
domain0 55000000
domain0 aa000000
domain1 00ff0000
domain1 ff00ffff
domain2 ffffffff

After
domain0 00000055
domain0 000000aa
domain0 00005500
domain0 0000aa00
domain0 00550000
domain0 00aa0000
domain0 55000000
domain0 aa000000
domain1 000000ff
domain1 0000ff00
domain1 00ff0000
domain1 ff000000
domain2 ff00ffff
domain2 ffffffff
domain3 ffffffff

(Lower is better)
perf stat -a -r 5 -n perf bench sched pipe | tail -n 2
Before
153.798 +- 0.142 seconds time elapsed ( +- 0.09% )

After
111.545 +- 0.652 seconds time elapsed ( +- 0.58% )

which is an improvement of 27.47%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100401.412519-4-srikar@linux.vnet.ibm.com


# b8b92803 26-Aug-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Update cpu_core_map on all PowerPc systems

lscpu() uses core_siblings to list the number of sockets in the
system. core_siblings is set using topology_core_cpumask.

While optimizing the powerpc bootup path, Commit 4ca234a9cbd7
("powerpc/smp: Stop updating cpu_core_mask"). it was found that
updating cpu_core_mask() ended up taking a lot of time. It was thought
that on Powerpc, cpu_core_mask() would always be same as
cpu_cpu_mask() i.e number of sockets will always be equal to number of
nodes. As an optimization, cpu_core_mask() was made a snapshot of
cpu_cpu_mask().

However that was found to be false with PowerPc KVM guests, where each
node could have more than one socket. So with Commit c47f892d7aa6
("powerpc/smp: Reintroduce cpu_core_mask"), cpu_core_mask was updated
based on chip_id but in an optimized way using some mask manipulations
and chip_id caching.

However on non-PowerNV and non-pseries KVM guests (i.e not
implementing cpu_to_chip_id(), continued to use a copy of
cpu_cpu_mask().

There are two issues that were noticed on such systems
1. lscpu would report one extra socket.
On a IBM,9009-42A (aka zz system) which has only 2 chips/ sockets/
nodes, lscpu would report
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 8
Core(s) per socket: 6
Socket(s): 3 <--------------
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159

2. Currently cpu_cpu_mask is updated when a core is
added/removed. However its not updated when smt mode switching or on
CPUs are explicitly offlined. However all other percpu masks are
updated to ensure only active/online CPUs are in the masks.
This results in build_sched_domain traces since there will be CPUs in
cpu_cpu_mask() but those CPUs are not present in SMT / CACHE / MC /
NUMA domains. A loop of threads running smt mode switching and core
add/remove will soon show this trace.
Hence cpu_cpu_mask has to be update at smt mode switch.

This will have impact on cpu_core_mask(). cpu_core_mask() is a
snapshot of cpu_cpu_mask. Different CPUs within the same socket will
end up having different cpu_core_masks since they are snapshots at
different points of time. This means when lscpu will start reporting
many more sockets than the actual number of sockets/ nodes / chips.

Different ways to handle this problem:
A. Update the snapshot aka cpu_core_mask for all CPUs whenever
cpu_cpu_mask is updated. This would a non-optimal solution.
B. Instead of a cpumask_var_t, make cpu_core_map a cpumask pointer
pointing to cpu_cpu_mask. However percpu cpumask pointer is frowned
upon and we need a clean way to handle PowerPc KVM guest which is
not a snapshot.
C. Update cpu_core_masks all PowerPc systems like in PowerPc KVM
guests using mask manipulations. This approach is relatively simple
and unifies with the existing code.
D. On top of 3, we could also resurrect get_physical_package_id which
could return a nid for the said CPU. However this is not needed at this
time.

Option C is the preferred approach for now.

While this is somewhat a revert of Commit 4ca234a9cbd7 ("powerpc/smp:
Stop updating cpu_core_mask").

1. Plain revert has some conflicts
2. For chip_id == -1, the cpu_core_mask is made identical to
cpu_cpu_mask, unlike previously where cpu_core_mask was set to a core
if chip_id doesn't exist.

This goes by the principle that if chip_id is not exposed, then
sockets / chip / node share the same set of CPUs.

With the fix, lscpu o/p would be
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 8
Core(s) per socket: 6
Socket(s): 2 <--------------
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159

Fixes: 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask")
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100401.412519-3-srikar@linux.vnet.ibm.com


# 8efd249b 26-Aug-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2

Aneesh reported a crash with a fairly recent upstream kernel when
booting kernel whose commandline was appended with nr_cpus=2

1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c000000008a67bd0]
pc: c00000000002557c: cpu_to_chip_id+0x3c/0x100
lr: c000000000058380: start_secondary+0x460/0xb00
sp: c000000008a67e70
msr: 8000000000001033
dar: 10
dsisr: 80000
current = 0xc00000000891bb00
paca = 0xc0000018ff981f80 irqmask: 0x03 irq_happened: 0x01
pid = 0, comm = swapper/1
Linux version 5.13.0-rc3-15704-ga050a6d2b7e8 (kvaneesh@ltc-boston8) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #433 SMP Tue May 25 02:38:49 CDT 2021
1:mon> t
[link register ] c000000000058380 start_secondary+0x460/0xb00
[c000000008a67e70] c000000008a67eb0 (unreliable)
[c000000008a67eb0] c0000000000589d4 start_secondary+0xab4/0xb00
[c000000008a67f90] c00000000000c654 start_secondary_prolog+0x10/0x14

Current code assumes that num_possible_cpus() is always greater than
threads_per_core. However this may not be true when using nr_cpus=2 or
similar options. Handle the case where num_possible_cpus() is not an
exact multiple of threads_per_core.

Fixes: c1e53367dab1 ("powerpc/smp: Cache CPU to chip lookup")
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Debugged-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100401.412519-2-srikar@linux.vnet.ibm.com


# e9ef81e1 28-Jul-2021 Parth Shah <parth@linux.ibm.com>

powerpc/smp: Use existing L2 cache_map cpumask to find L3 cache siblings

On POWER10 systems, the "ibm,thread-groups" property "2" indicates the cpus
in thread-group share both L2 and L3 caches. Hence, use cache_property = 2
itself to find both the L2 and L3 cache siblings.
Hence, create a new thread_group_l3_cache_map to keep list of L3 siblings,
but fill the mask using same property "2" array.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210728175607.591679-4-parth@linux.ibm.com


# a4bec516 28-Jul-2021 Gautham R. Shenoy <ego@linux.vnet.ibm.com>

powerpc/cacheinfo: Lookup cache by dt node and thread-group id

Currently the cacheinfo code on powerpc indexes the "cache" objects
(modelling the L1/L2/L3 caches) where the key is device-tree node
corresponding to that cache. On some of the POWER server platforms
thread-groups within the core share different sets of caches (Eg: On
SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
threads 1,3,5,7 of the same core share another L1 cache). On such
platforms, there is a single device-tree node corresponding to that
cache and the cache-configuration within the threads of the core is
indicated via "ibm,thread-groups" device-tree property.

Since the current code is not aware of the "ibm,thread-groups"
property, on the aforementoined systems, cacheinfo code still treats
all the threads in the core to be sharing the cache because of the
single device-tree node (In the earlier example, the cacheinfo code
would says CPUs 0-7 share L1 cache).

In this patch, we make the powerpc cacheinfo code aware of the
"ibm,thread-groups" property. We indexe the "cache" objects by the
key-pair (device-tree node, thread-group id). For any CPUX, for a
given level of cache, the thread-group id is defined to be the first
CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
of cache which are not represented in "ibm,thread-groups" property,
the thread-group id is -1.

[parth: Remove "static" keyword for the definition of "thread_group_l1_cache_map"
and "thread_group_l2_cache_map" to get rid of the compile error.]

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210728175607.591679-2-parth@linux.ibm.com


# bab26238 22-Jun-2021 Nicholas Piggin <npiggin@gmail.com>

powerpc: Offline CPU in stop_this_cpu()

printk_safe_flush_on_panic() has special lock breaking code for the case
where we panic()ed with the console lock held. It relies on panic IPI
causing other CPUs to mark themselves offline.

Do as most other architectures do.

This effectively reverts commit de6e5d38417e ("powerpc: smp_send_stop do
not offline stopped CPUs"), unfortunately it may result in some false
positive warnings, but the alternative is more situations where we can
crash without getting messages out.

Fixes: de6e5d38417e ("powerpc: smp_send_stop do not offline stopped CPUs")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623041245.865134-1-npiggin@gmail.com


# 86f46f34 03-Jun-2021 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc/32s: Initialise KUAP and KUEP in C

In order to selectively activate KUAP and KUEP in a following patch,
perform KUAP and KUEP initialisation in C.

Unlike PPC64, PPC32 doesn't have an early_setup_secondary(),
so do it in start_secondary().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/87be72023448dd4e476744ed279b8c04b8d08a1c.1622708530.git.christophe.leroy@csgroup.eu


# a9ee6cf5 28-Jun-2021 Mike Rapoport <rppt@kernel.org>

mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.

Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

Done with

$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
$(git grep -wl NEED_MULTIPLE_NODES)

with manual tweaks afterwards.

[rppt@linux.ibm.com: fix arm boot crash]
Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com

Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1a0a376 12-May-2021 Valentin Schneider <valentin.schneider@arm.com>

sched/core: Initialize the idle task with preemption disabled

As pointed out by commit

de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread")

init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.

As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().

Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().

Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().

Secondary startups were patched via coccinelle:

@begone@
@@

-preempt_disable();
...
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com


# 6980d13f 01-Apr-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Set numa node before updating mask

Geethika reported a trace when doing a dlpar CPU add.

------------[ cut here ]------------
WARNING: CPU: 152 PID: 1134 at kernel/sched/topology.c:2057
CPU: 152 PID: 1134 Comm: kworker/152:1 Not tainted 5.12.0-rc5-master #5
Workqueue: events cpuset_hotplug_workfn
NIP: c0000000001cfc14 LR: c0000000001cfc10 CTR: c0000000007e3420
REGS: c0000034a08eb260 TRAP: 0700 Not tainted (5.12.0-rc5-master+)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28828422 XER: 00000020
CFAR: c0000000001fd888 IRQMASK: 0 #012GPR00: c0000000001cfc10
c0000034a08eb500 c000000001f35400 0000000000000027 #012GPR04:
c0000035abaa8010 c0000035abb30a00 0000000000000027 c0000035abaa8018
#012GPR08: 0000000000000023 c0000035abaaef48 00000035aa540000
c0000035a49dffe8 #012GPR12: 0000000028828424 c0000035bf1a1c80
0000000000000497 0000000000000004 #012GPR16: c00000000347a258
0000000000000140 c00000000203d468 c000000001a1a490 #012GPR20:
c000000001f9c160 c0000034adf70920 c0000034aec9fd20 0000000100087bd3
#012GPR24: 0000000100087bd3 c0000035b3de09f8 0000000000000030
c0000035b3de09f8 #012GPR28: 0000000000000028 c00000000347a280
c0000034aefe0b00 c0000000010a2a68
NIP [c0000000001cfc14] build_sched_domains+0x6a4/0x1500
LR [c0000000001cfc10] build_sched_domains+0x6a0/0x1500
Call Trace:
[c0000034a08eb500] [c0000000001cfc10] build_sched_domains+0x6a0/0x1500 (unreliable)
[c0000034a08eb640] [c0000000001d1e6c] partition_sched_domains_locked+0x3ec/0x530
[c0000034a08eb6e0] [c0000000002936d4] rebuild_sched_domains_locked+0x524/0xbf0
[c0000034a08eb7e0] [c000000000296bb0] rebuild_sched_domains+0x40/0x70
[c0000034a08eb810] [c000000000296e74] cpuset_hotplug_workfn+0x294/0xe20
[c0000034a08ebc30] [c000000000178dd0] process_one_work+0x300/0x670
[c0000034a08ebd10] [c0000000001791b8] worker_thread+0x78/0x520
[c0000034a08ebda0] [c000000000185090] kthread+0x1a0/0x1b0
[c0000034a08ebe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
7d2903a6 4e800421 e8410018 7f67db78 7fe6fb78 7f45d378 7f84e378 7c681b78
3c62ff1a 3863c6f8 4802dc35 60000000 <0fe00000> 3920fff4 f9210070 e86100a0
---[ end trace 532d9066d3d4d7ec ]---

Some of the per-CPU masks use cpu_cpu_mask as a filter to limit the search
for related CPUs. On a dlpar add of a CPU, update cpu_cpu_mask before
updating the per-CPU masks. This will ensure the cpu_cpu_mask is updated
correctly before its used in setting the masks. Setting the numa_node will
ensure that when cpu_cpu_mask() gets called, the correct node number is
used. This code movement helped fix the above call trace.

Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210401154200.150077-1-srikar@linux.vnet.ibm.com


# c1e53367 15-Apr-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Cache CPU to chip lookup

On systems with large CPUs per node, even with the filtered matching of
related CPUs, there can be large number of calls to cpu_to_chip_id for
the same CPU. For example with 4096 vCPU, 1 node QEMU configuration,
with 4 threads per core, system could be see upto 1024 calls to
cpu_to_chip_id() for the same CPU. On a given system, cpu_to_chip_id()
for a given CPU would always return the same. Hence cache the result in
a lookup table for use in subsequent calls.

Since all CPUs sharing the same core will belong to the same chip, the
lookup_table has an entry for one CPU per core. chip_id_lookup_table is
not being freed and would be used on subsequent CPU online post CPU
offline.

Reported-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210415120934.232271-4-srikar@linux.vnet.ibm.com


# c47f892d 15-Apr-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Reintroduce cpu_core_mask

Daniel reported that with Commit 4ca234a9cbd7 ("powerpc/smp: Stop
updating cpu_core_mask") QEMU was unable to set single NUMA node SMP
topologies such as:
-smp 8,maxcpus=8,cores=2,threads=2,sockets=2
i.e he expected 2 sockets in one NUMA node.

The above commit helped to reduce boot time on Large Systems for
example 4096 vCPU single socket QEMU instance. PAPR is silent on
having more than one socket within a NUMA node.

cpu_core_mask and cpu_cpu_mask for any CPU would be same unless the
number of sockets is different from the number of NUMA nodes.

One option is to reintroduce cpu_core_mask but use a slightly
different method to arrive at the cpu_core_mask. Previously each CPU's
chip-id would be compared with all other CPU's chip-id to verify if
both the CPUs were related at the chip level. Now if a CPU 'A' is
found related / (unrelated) to another CPU 'B', all the thread
siblings of 'A' and thread siblings of 'B' are automatically marked as
related / (unrelated).

Also if a platform doesn't support ibm,chip-id property, i.e its
cpu_to_chip_id returns -1, cpu_core_map holds a copy of
cpu_cpu_mask().

Fixes: 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask")
Reported-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210415120934.232271-2-srikar@linux.vnet.ibm.com


# 078277ac 07-Apr-2021 Yu Kuai <yukuai3@huawei.com>

powerpc/smp: Make some symbols static

The sparse tool complains as follows:

arch/powerpc/kernel/smp.c:86:1: warning:
symbol '__pcpu_scope_cpu_coregroup_map' was not declared. Should it be static?
arch/powerpc/kernel/smp.c:125:1: warning:
symbol '__pcpu_scope_thread_group_l1_cache_map' was not declared. Should it be static?
arch/powerpc/kernel/smp.c:132:1: warning:
symbol '__pcpu_scope_thread_group_l2_cache_map' was not declared. Should it be static?

These symbols are not used outside of smp.c, so this
commit marks them static.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210407125903.4139663-1-yukuai3@huawei.com


# 157c9f40 04-Jan-2021 Cédric Le Goater <clg@kaod.org>

powerpc/smp: Make debugger_ipi_callback() static

debugger_ipi_callback() is a local routine used as a NMI IPI handler and
does not need to be external.

It fixes this W=1 compile error :

../arch/powerpc/kernel/smp.c:579:6: error: no previous prototype for ‘debugger_ipi_callback’ [-Werror=missing-prototypes]
579 | void debugger_ipi_callback(struct pt_regs *regs)
| ^~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-10-clg@kaod.org


# cd7aa5d2 04-Jan-2021 Cédric Le Goater <clg@kaod.org>

powerpc/smp: Include tick_broadcast() prototype

It fixes this W=1 compile error :

../arch/powerpc/kernel/smp.c:569:6: error: no previous prototype for ‘tick_broadcast’ [-Werror=missing-prototypes]
569 | void tick_broadcast(const struct cpumask *mask)
| ^~~~~~~~~~~~~~

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-9-clg@kaod.org


# 9014eab6 21-Dec-2020 Cédric Le Goater <clg@kaod.org>

powerpc/smp: Add __init to init_big_cores()

It fixes this link warning:

WARNING: modpost: vmlinux.o(.text.unlikely+0x2d98): Section mismatch in reference from the function init_big_cores.isra.0() to the function .init.text:init_thread_group_cache_map()
The function init_big_cores.isra.0() references
the function __init init_thread_group_cache_map().
This is often because init_big_cores.isra.0 lacks a __init
annotation or the annotation of init_thread_group_cache_map is wrong.

Fixes: 425752c63b6f ("powerpc: Detect the presence of big-cores via "ibm, thread-groups"")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201221074154.403779-1-clg@kaod.org


# 9538abee 10-Dec-2020 Gautham R. Shenoy <ego@linux.vnet.ibm.com>

powerpc/smp: Add support detecting thread-groups sharing L2 cache

On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".

This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.

On a platform with the following "ibm,thread-group" configuration
00000001 00000002 00000004 00000000
00000002 00000004 00000006 00000001
00000003 00000005 00000007 00000002
00000002 00000004 00000000 00000002
00000004 00000006 00000001 00000003
00000005 00000007

Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[00000002 00000002 00000004
00000000 00000002 00000004 00000006
00000001 00000003 00000005 00000007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

With this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1607596739-32439-5-git-send-email-ego@linux.vnet.ibm.com


# fbd2b672 10-Dec-2020 Gautham R. Shenoy <ego@linux.vnet.ibm.com>

powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic

init_thread_group_l1_cache_map() initializes the per-cpu cpumask
thread_group_l1_cache_map with the core-siblings which share L1 cache
with the CPU. Make this function generic to the cache-property (L1 or
L2) and update a suitable mask. This is a preparatory patch for the
next patch where we will introduce discovery of thread-groups that
share L2-cache.

No functional change.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1607596739-32439-4-git-send-email-ego@linux.vnet.ibm.com


# 1fdc1d66 10-Dec-2020 Gautham R. Shenoy <ego@linux.vnet.ibm.com>

powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map

On platforms which have the "ibm,thread-groups" property, the per-cpu
variable cpu_l1_cache_map keeps a track of which group of threads
within the same core share the L1 cache, Instruction and Data flow.

This patch renames the variable to "thread_group_l1_cache_map" to make
it consistent with a subsequent patch which will introduce
thread_group_l2_cache_map.

This patch introduces no functional change.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1607596739-32439-3-git-send-email-ego@linux.vnet.ibm.com


# 790a1662 10-Dec-2020 Gautham R. Shenoy <ego@linux.vnet.ibm.com>

powerpc/smp: Parse ibm,thread-groups with multiple properties

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
the second group is {9,11,13,15}. Property "1" is indicative of
the thread in the group sharing L1 cache, translation cache and
Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
the first group is {8,10,12,14} and the
"ibm,ppc-interrupt-server#s" of the second group is
{9,11,13,15}. Property "2" indicates that the threads in each group
share the L2-cache.

The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

This patch extends the parsing support on platforms which encode
information about multiple properties being shared by the
corresponding groups of threads.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1607596739-32439-2-git-send-email-ego@linux.vnet.ibm.com


# 99f070b6 28-Oct-2020 Qian Cai <cai@redhat.com>

powerpc/smp: Call rcu_cpu_starting() earlier

The call to rcu_cpu_starting() in start_secondary() is not early
enough in the CPU-hotplug onlining process, which results in lockdep
splats as follows (with CONFIG_PROVE_RCU_LIST=y):

WARNING: suspicious RCU usage
-----------------------------
kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.

Call Trace:
dump_stack+0xec/0x144 (unreliable)
lockdep_rcu_suspicious+0x128/0x14c
__lock_acquire+0x1060/0x1c60
lock_acquire+0x140/0x5f0
_raw_spin_lock_irqsave+0x64/0xb0
clockevents_register_device+0x74/0x270
register_decrementer_clockevent+0x94/0x110
start_secondary+0x134/0x800
start_secondary_prolog+0x10/0x14

This is avoided by adding a call to rcu_cpu_starting() near the
beginning of the start_secondary() function. Note that the
raw_smp_processor_id() is required in order to avoid calling into
lockdep before RCU has declared the CPU to be watched for readers.

It's safe to call rcu_cpu_starting() in the arch code as well as later
in generic code, as explained by Paul:

It uses a per-CPU variable so that RCU pays attention only to the
first call to rcu_cpu_starting() if there is more than one of them.
This is even intentional, due to there being a generic
arch-independent call to rcu_cpu_starting() in
notify_cpu_starting().

So multiple calls to rcu_cpu_starting() are fine by design.

Fixes: 4d004099a668 ("lockdep: Fix lockdep recursion")
Signed-off-by: Qian Cai <cai@redhat.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
[mpe: Add Fixes tag, reword slightly & expand change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201028182334.13466-1-cai@redhat.com


# 84dbf66c 18-Oct-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Use GFP_ATOMIC while allocating tmp mask

Qian Cai reported a regression where CPU Hotplug fails with the latest
powerpc/next

BUG: sleeping function called from invalid context at mm/slab.h:494
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/88
no locks held by swapper/88/0.
irq event stamp: 18074448
hardirqs last enabled at (18074447): [<c0000000001a2a7c>] tick_nohz_idle_enter+0x9c/0x110
hardirqs last disabled at (18074448): [<c000000000106798>] do_idle+0x138/0x3b0
do_idle at kernel/sched/idle.c:253 (discriminator 1)
softirqs last enabled at (18074440): [<c0000000000bbec4>] irq_enter_rcu+0x94/0xa0
softirqs last disabled at (18074439): [<c0000000000bbea0>] irq_enter_rcu+0x70/0xa0
CPU: 88 PID: 0 Comm: swapper/88 Tainted: G W 5.9.0-rc8-next-20201007 #1
Call Trace:
[c00020000a4bfcf0] [c000000000649e98] dump_stack+0xec/0x144 (unreliable)
[c00020000a4bfd30] [c0000000000f6c34] ___might_sleep+0x2f4/0x310
[c00020000a4bfdb0] [c000000000354f94] slab_pre_alloc_hook.constprop.82+0x124/0x190
[c00020000a4bfe00] [c00000000035e9e8] __kmalloc_node+0x88/0x3a0
slab_alloc_node at mm/slub.c:2817
(inlined by) __kmalloc_node at mm/slub.c:4013
[c00020000a4bfe80] [c0000000006494d8] alloc_cpumask_var_node+0x38/0x80
kmalloc_node at include/linux/slab.h:577
(inlined by) alloc_cpumask_var_node at lib/cpumask.c:116
[c00020000a4bfef0] [c00000000003eedc] start_secondary+0x27c/0x800
update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267
(inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387
(inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420
[c00020000a4bff90] [c00000000000c468] start_secondary_resume+0x10/0x14

Allocating a temporary mask while performing a CPU Hotplug operation
with CONFIG_CPUMASK_OFFSTACK enabled, leads to calling a sleepable
function from a atomic context. Fix this by allocating the temporary
mask with GFP_ATOMIC flag. Also instead of having to allocate twice,
allocate the mask in the caller so that we only have to allocate once.
If the allocation fails, assume the mask to be same as sibling mask, which
will make the scheduler to drop this domain for this CPU.

Fixes: 70a94089d7f7 ("powerpc/smp: Optimize update_coregroup_mask")
Fixes: 3ab33d6dc3e9 ("powerpc/smp: Optimize update_mask_by_l2")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201019042716.106234-3-srikar@linux.vnet.ibm.com


# 966730a6 18-Oct-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Remove unnecessary variable

Commit 3ab33d6dc3e9 ("powerpc/smp: Optimize update_mask_by_l2")
introduced submask_fn in update_mask_by_l2 to track the right submask.
However commit f6606cfdfbcd ("powerpc/smp: Dont assume l2-cache to be
superset of sibling") introduced sibling_mask in update_mask_by_l2 to
track the same submask. Remove sibling_mask in favour of submask_fn.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201019042716.106234-2-srikar@linux.vnet.ibm.com


# 70a94089 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Optimize update_coregroup_mask

All threads of a SMT4/SMT8 core can either be part of CPU's coregroup
mask or outside the coregroup. Use this relation to reduce the
number of iterations needed to find all the CPUs that share the same
coregroup

Use a temporary mask to iterate through the CPUs that may share
coregroup mask. Also instead of setting one CPU at a time into
cpu_coregroup_mask, copy the SMT4/SMT8/submask at one shot.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-12-srikar@linux.vnet.ibm.com


# b8a97cb4 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Move coregroup mask updation to a new function

Move the logic for updating the coregroup mask of a CPU to its own
function. This will help in reworking the updation of coregroup mask in
subsequent patch.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-11-srikar@linux.vnet.ibm.com


# 3ab33d6d 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Optimize update_mask_by_l2

All threads of a SMT4 core can either be part of this CPU's l2-cache
mask or not related to this CPU l2-cache mask. Use this relation to
reduce the number of iterations needed to find all the CPUs that share
the same l2-cache.

Use a temporary mask to iterate through the CPUs that may share l2_cache
mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask,
copy the SMT4/sub mask at one shot.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-10-srikar@linux.vnet.ibm.com


# 375370a1 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Check for duplicate topologies and consolidate

CACHE and COREGROUP domains are now part of default topology. However on
systems that don't support CACHE or COREGROUP, these domains will
eventually be degenerated. The degeneration happens per CPU. Do note the
current fixup_topology() logic ensures that mask of a domain that is not
supported on the current platform is set to the previous domain.

Instead of waiting for the scheduler to degenerated try to consolidate
based on their masks and sd_flags. This is done just before setting
the scheduler topology.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-9-srikar@linux.vnet.ibm.com


# 661e3d42 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs

Currently on hotplug/hotunplug, CPU iterates through all the CPUs in
its core to find threads in its thread group. However this info is
already captured in cpu_l1_cache_map. Hence reduce iterations and
cleanup add_cpu_to_smallcore_masks function.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-8-srikar@linux.vnet.ibm.com


# 1f3a4181 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Stop passing mask to update_mask_by_l2

update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask
as parameter. Instead of passing cpu_l2_cache_mask, use it directly in
update_mask_by_l2.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-7-srikar@linux.vnet.ibm.com


# 53516d4a 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Limit CPUs traversed to within a node.

All the arch specific topology cpumasks are within a node/DIE.
However when setting these per CPU cpumasks, system traverses through
all the online CPUs. This is redundant.

Reduce the traversal to only CPUs that are online in the node to which
the CPU belongs to.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-6-srikar@linux.vnet.ibm.com


# 70edd4a7 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Optimize remove_cpu_from_masks

While offlining a CPU, system currently iterate through all the CPUs in
the DIE to clear sibling, l2_cache and smallcore maps. However if there
are more cores in a DIE, system can end up spending more time iterating
through CPUs which are completely unrelated.

Optimize this by only iterating through smaller but relevant cpumap.
If shared_cache is set, cpu_l2_cache_map should be relevant else
cpu_sibling_map would be relevant.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-5-srikar@linux.vnet.ibm.com


# e29e9ed6 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Remove get_physical_package_id

Now that cpu_core_mask has been removed and topology_core_cpumask has
been updated to use cpu_cpu_mask, we no more need
get_physical_package_id.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-4-srikar@linux.vnet.ibm.com


# 4ca234a9 21-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Stop updating cpu_core_mask

Anton Blanchard reported that his 4096 vcpu KVM guest took around 30
minutes to boot. He also analyzed it to the time taken to iterate while
setting the cpu_core_mask.

Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU
would be equal on Power. However updating cpu_core_mask took forever to
update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per
NODE /per DIE cpumask that was shared by all the respective CPUs.

Also cpu_cpu_mask is needed from a scheduler perspective. However
cpu_core_map is an exported symbol. Hence stop updating cpu_core_map
and make it point to cpu_cpu_mask.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200921095653.9701-3-srikar@linux.vnet.ibm.com


# 39f87561 18-Aug-2020 Michael Ellerman <mpe@ellerman.id.au>

powerpc/smp: Move ppc_md.cpu_die() to smp_ops.cpu_offline_self()

We have smp_ops->cpu_die() and ppc_md.cpu_die(). One of them offlines
the current CPU and one offlines another CPU, can you guess which is
which? Also one is in smp_ops and one is in ppc_md?

So rename ppc_md.cpu_die(), to cpu_offline_self(), because that's what
it does. And move it into smp_ops where it belongs.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015634.1974478-3-mpe@ellerman.id.au


# bf3c1464 18-Aug-2020 Michael Ellerman <mpe@ellerman.id.au>

powerpc/smp: Fold cpu_die() into its only caller

Avoid the eternal confusion between cpu_die() and __cpu_die() by
removing the former, folding it into its only caller.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015634.1974478-2-mpe@ellerman.id.au


# 1ea21ba2 18-Aug-2020 Michael Ellerman <mpe@ellerman.id.au>

powerpc: Move arch_cpu_idle_dead() into smp.c

arch_cpu_idle_dead() is in idle.c, which makes sense, but it's inside
a CONFIG_HOTPLUG_CPU block.

It would be more at home in smp.c, inside the existing
CONFIG_HOTPLUG_CPU block. Note that CONFIG_HOTPLUG_CPU depends on
CONFIG_SMP so even though smp.c is not built for SMP=n builds, that's
fine.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819015634.1974478-1-mpe@ellerman.id.au


# 72730bfc 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Create coregroup domain

Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain. Do note this patch is only creating stubs
for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
would be in a subsequent patch.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com


# 6e086302 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Allocate cpumask only after searching thread group

If allocated earlier and the search fails, then cpu_l1_cache_map cpumask
is unnecessarily cleared. However cpu_l1_cache_map can be allocated /
cleared after we search thread group.

Please note CONFIG_CPUMASK_OFFSTACK is not set on Powerpc. Hence cpumask
allocated by zalloc_cpumask_var_node is never freed.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-9-srikar@linux.vnet.ibm.com


# f9f130ff 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Detect support for coregroup

Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups, i.e one sub-group
per DIE.

Currently there are no firmwares that are exposing this grouping. Hence
allow the basis for grouping to be abstract. Once the firmware starts
using this grouping, code would be added to detect the type of grouping
and adjust the sd domain flags accordingly.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com


# caa8e29d 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Optimize start_secondary

In start_secondary, even if shared_cache was already set, system does a
redundant match for cpumask. This redundant check can be removed by
checking if shared_cache is already set.

While here, localize the sibling_mask variable to within the if
condition.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-7-srikar@linux.vnet.ibm.com


# f6606cfd 13-Sep-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Dont assume l2-cache to be superset of sibling

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200913171038.GB11808@linux.vnet.ibm.com


# 3c6032a8 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Move topology fixups into a new function

Move topology fixup based on the platform attributes into its own
function which is called just before set_sched_topology.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-5-srikar@linux.vnet.ibm.com


# 5e93f16a 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Move powerpc_topology above

Just moving the powerpc_topology description above.
This will help in using functions in this file and avoid declarations.

No other functional changes

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-4-srikar@linux.vnet.ibm.com


# 2ef0ca54 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Merge Power9 topology with Power topology

A new sched_domain_topology_level was added just for Power9. However the
same can be achieved by merging powerpc_topology with power9_topology
and makes the code more simpler especially when adding a new sched
domain.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-3-srikar@linux.vnet.ibm.com


# d0fd24bb 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES

Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES
"error: _numa_cpu_lookup_table_ undeclared"

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-2-srikar@linux.vnet.ibm.com


# e0d8e991 08-Jul-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/book3s64/kuap: Move UAMOR setup to key init function

UAMOR values are not application-specific. The kernel initializes
its value based on different reserved keys. Remove the thread-specific
UAMOR value and don't switch the UAMOR on context switch.

Move UAMOR initialization to key initialization code and remove
thread_struct.uamor because it is not used anymore.

Before commit: 4a4a5e5d2aad ("powerpc/pkeys: key allocation/deallocation must not change pkey registers")
we used to update uamor based on key allocation and free.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-20-aneesh.kumar@linux.ibm.com


# 65fddcfc 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: reorder includes after introduction of linux/pgtable.h

The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include
of the latter in the middle of asm includes. Fix this up with the aid of
the below script and manual adjustments here and there.

import sys
import re

if len(sys.argv) is not 3:
print "USAGE: %s <file> <header>" % (sys.argv[0])
sys.exit(1)

hdr_to_move="#include <linux/%s>" % sys.argv[2]
moved = False
in_hdrs = False

with open(sys.argv[1], "r") as f:
lines = f.readlines()
for _line in lines:
line = _line.rstrip('
')
if line == hdr_to_move:
continue
if line.startswith("#include <linux/"):
in_hdrs = True
elif not moved and in_hdrs:
moved = True
print hdr_to_move
print line

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca5999fd 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: introduce include/linux/pgtable.h

The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 82a7cebd 28-May-2020 Michael Neuling <mikey@neuling.org>

powerpc: Fix misleading small cores print

Currently when we boot on a big core system, we get this print:
[ 0.040500] Using small cores at SMT level

This is misleading as we've actually detected big cores.

This patch clears up the print to say we've detect big cores but are
using small cores for scheduling.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200528230731.1235752-1-mikey@neuling.org


# c72e8da0 13-Mar-2020 Michael Ellerman <mpe@ellerman.id.au>

powerpc/smp: Use IS_ENABLED() to avoid #ifdef

We can avoid the #ifdef by using IS_ENABLED() in the existing
condition check.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200313112020.28235-2-mpe@ellerman.id.au


# 4b4d181d 13-Mar-2020 Michael Ellerman <mpe@ellerman.id.au>

powerpc/smp: Drop superfluous NULL check

We don't need the NULL check of np, the result is the same because the
OF helpers cope with NULL, of_node_to_nid(NULL) == NUMA_NO_NODE (-1).

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200313112020.28235-1-mpe@ellerman.id.au


# 247257b0 29-Jan-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Remove late request for home node associativity

With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200129135301.24739-6-srikar@linux.vnet.ibm.com


# a05f0e5b 29-Jan-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Use nid as fallback for package_id

package_id is to match cores that are part of the same chip. On
PowerNV machines, package_id defaults to chip_id. However ibm,chip_id
property is not present in device-tree of PowerVM LPARs. Hence lscpu
output shows one core per socket and multiple cores.

To overcome this, use nid as the package_id on PowerVM LPARs.

Before the patch:

Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 8
Core(s) per socket: 1 <----------------------
Socket(s): 16 <----------------------
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
#
# cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
-1

After the patch:

Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 8 <---------------------
Core(s) per socket: 8 <---------------------
Socket(s): 2
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
#
# cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
0

Now lscpu output is more in line with the system configuration.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Use pkg_id instead of ppid, tweak change log and comment]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200129135121.24617-1-srikar@linux.vnet.ibm.com


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7c19c2e5 31-Jan-2019 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc: 'current_set' is now a table of task_struct pointers

The table of pointers 'current_set' has been used for retrieving
the stack and current. They used to be thread_info pointers as
they were pointing to the stack and current was taken from the
'task' field of the thread_info.

Now, the pointers of 'current_set' table are now both pointers
to task_struct and pointers to thread_info.

As they are used to get current, and the stack pointer is
retrieved from current's stack field, this patch changes
their type to task_struct, and renames secondary_ti to
secondary_current.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ed1cd6de 31-Jan-2019 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc: Activate CONFIG_THREAD_INFO_IN_TASK

This patch activates CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are leaked,
making a number of attacks more difficult.

This has the following consequences:
- thread_info is now located at the beginning of task_struct.
- The 'cpu' field is now in task_struct, and only exists when
CONFIG_SMP is active.
- thread_info doesn't have anymore the 'task' field.

This patch:
- Removes all recopy of thread_info struct when the stack changes.
- Changes the CURRENT_THREAD_INFO() macro to point to current.
- Selects CONFIG_THREAD_INFO_IN_TASK.
- Modifies raw_smp_processor_id() to get ->cpu from current without
including linux/sched.h to avoid circular inclusion and without
including asm/asm-offsets.h to avoid symbol names duplication
between ASM constants and C constants.
- Modifies klp_init_thread_info() to take a task_struct pointer
argument.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add task_stack.h to livepatch.h to fix build fails]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 678c668a 17-Jan-2019 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/64: Use task_stack_page() to initialise paca->kstack

Rather than using the thread info use task_stack_page() to initialise
paca->kstack, that way it will work with THREAD_INFO_IN_TASK.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6fe243fe 25-Nov-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/smp: Make __smp_send_nmi_ipi() static

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 88b9a3d1 25-Nov-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/smp: Fix NMI IPI xmon timeout

The xmon debugger IPI handler waits in the callback function while
xmon is still active. This means they don't complete the IPI, and the
initiator always times out waiting for them.

Things manage to work after the timeout because there is some fallback
logic to keep NMI IPI state sane in case of the timeout, but this is a
bit ugly.

This patch changes NMI IPI back to half-asynchronous (i.e., wait for
everyone to call in, do not wait for IPI function to complete), but
the complexity is avoided by going one step further and allowing new
IPIs to be issued before the IPI functions to all complete.

If synchronization against that is required, it is left up to the
caller, but current callers don't require that. In fact with the
timeout handling, callers must be able to cope with this already.

Fixes: 5b73151fff63 ("powerpc: NMI IPI make NMI IPIs fully sychronous")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 1b5fc84a 25-Nov-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/smp: Fix NMI IPI timeout

The NMI IPI timeout logic is broken, if __smp_send_nmi_ipi() times out
on the first condition, delay_us will be zero which will send it into
the second spin loop with no timeout so it will spin forever.

Fixes: 5b73151fff63 ("powerpc: NMI IPI make NMI IPIs fully sychronous")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# b6aeddea 18-Oct-2018 Michael Ellerman <mpe@ellerman.id.au>

powerpc: Fix stack protector crashes on CPU hotplug

Recently in commit 7241d26e8175 ("powerpc/64: properly initialise
the stackprotector canary on SMP.") we fixed a crash with stack
protector on SMP by initialising the stack canary in
cpu_idle_thread_init().

But this can also causes crashes, when a CPU comes back online after
being offline:

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: pnv_smp_cpu_kill_self+0x2a0/0x2b0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-rc3-gcc-7.3.1-00168-g4ffe713b7587 #94
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
panic+0x144/0x328
__stack_chk_fail+0x2c/0x30
pnv_smp_cpu_kill_self+0x2a0/0x2b0
cpu_die+0x48/0x70
arch_cpu_idle_dead+0x20/0x40
do_idle+0x274/0x390
cpu_startup_entry+0x38/0x50
start_secondary+0x5e4/0x600
start_secondary_prolog+0x10/0x14

Looking at the stack we see that the canary value in the stack frame
doesn't match the canary in the task/paca. That is because we have
reinitialised the task/paca value, but then the CPU coming online has
returned into a function using the old canary value. That causes the
comparison to fail.

Instead we can call boot_init_stack_canary() from start_secondary()
which never returns. This is essentially what the generic code does in
cpu_startup_entry() under #ifdef X86, we should make that non-x86
specific in a future patch.

Fixes: 7241d26e8175 ("powerpc/64: properly initialise the stackprotector canary on SMP.")
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>


# 8e8a31d7 10-Oct-2018 Gautham R. Shenoy <ego@linux.vnet.ibm.com>

powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores

POWER9 SMT8 cores consist of two groups of threads, where threads in
each group shares L1-cache. The scheduler is not aware of this
distinction as the current sched-domain hierarchy has all the threads
of the core defined at the SMT domain.

SMT [Thread siblings of the SMT8 core]
DIE [CPUs in the same die]
NUMA [All the CPUs in the system]

Due to this, we can observe run-to-run variance when we run a
multi-threaded benchmark bound to a single core based on how the
scheduler spreads the software threads across the two groups in the
core.

We fix this in this patch by defining each group of threads which
share L1-cache to be the SMT level. The group of threads in the SMT8
core is defined to be the CACHE level. The sched-domain hierarchy
after this patch will be :

SMT [Thread siblings in the core that share L1 cache]
CACHE [Thread siblings that are in the SMT8 core]
DIE [CPUs in the same die]
NUMA [All the CPUs in the system]

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 425752c6 10-Oct-2018 Gautham R. Shenoy <ego@linux.vnet.ibm.com>

powerpc: Detect the presence of big-cores via "ibm, thread-groups"

On IBM POWER9, the device tree exposes a property array identifed by
"ibm,thread-groups" which will indicate which groups of threads share
a particular set of resources.

As of today we only have one form of grouping identifying the group of
threads in the core that share the L1 cache, translation cache and
instruction data flow.

This patch adds helper functions to parse the contents of
"ibm,thread-groups" and populate a per-cpu variable to cache
information about siblings of each CPU that share the L1, traslation
cache and instruction data-flow.

It also defines a new global variable named "has_big_cores" which
indicates if the cores on this configuration have multiple groups of
threads that share L1 cache.

For each online CPU, it maintains a cpu_smallcore_mask, which
indicates the online siblings which share the L1-cache with it.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 7241d26e 13-Oct-2018 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/64: properly initialise the stackprotector canary on SMP.

commit 06ec27aea9fc ("powerpc/64: add stack protector support")
doesn't initialise the stack canary on SMP secondary CPU's paca,
leading to the following false positive report from the
stack protector.

smp: Bringing up secondary CPUs ...
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __schedule+0x978/0xa80
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-rc7-next-20181010-autotest-autotest #1
Call Trace:
[c000001fed5b3bf0] [c000000000a0ef3c] dump_stack+0xb0/0xf4 (unreliable)
[c000001fed5b3c30] [c0000000000f9d68] panic+0x140/0x308
[c000001fed5b3cc0] [c0000000000f9844] __stack_chk_fail+0x24/0x30
[c000001fed5b3d20] [c000000000a2c3a8] __schedule+0x978/0xa80
[c000001fed5b3e00] [c000000000a2c9b4] schedule_idle+0x34/0x60
[c000001fed5b3e30] [c00000000013d344] do_idle+0x224/0x3d0
[c000001fed5b3ec0] [c00000000013d6e0] cpu_startup_entry+0x30/0x50
[c000001fed5b3ef0] [c000000000047f34] start_secondary+0x4d4/0x520
[c000001fed5b3f90] [c00000000000b370] start_secondary_prolog+0x10/0x14

This patch properly initialises the stack_canary of the secondary
idle tasks.

Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Fixes: 06ec27aea9fc ("powerpc/64: add stack protector support")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2ea62630 17-Aug-2018 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/topology: Get topology for shared processors at boot

On a shared LPAR, Phyp will not update the CPU associativity at boot
time. Just after the boot system does recognize itself as a shared
LPAR and trigger a request for correct CPU associativity. But by then
the scheduler would have already created/destroyed its sched domains.

This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all CPUs to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
node sched domain"), can cause rcu stalls at boot up.

The sched_domains_numa_masks table which is used to generate cpumasks
is only created at boot time just before creating sched domains and
never updated. Hence, its better to get the topology correct before
the sched domains are created.

For example on 64 core Power 8 shared LPAR, dmesg reports

Brought up 512 CPUs
Node 0 CPUs: 0-511
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
Node 4 CPUs:
Node 5 CPUs:
Node 6 CPUs:
Node 7 CPUs:
Node 8 CPUs:
Node 9 CPUs:
Node 10 CPUs:
Node 11 CPUs:
...
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain

numactl/lscpu output will still be correct with cores spreading across
all nodes:

Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455

Currently on this LPAR, the scheduler detects 2 levels of Numa and
created numa sched domains for all CPUs, but it finds a single DIE
domain consisting of all CPUs. Hence it deletes all numa sched
domains.

To address this, detect the shared processor and update topology soon
after CPUs are setup so that correct topology is updated just before
scheduler creates sched domain.

With the fix, dmesg reports:

numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
numa: Node 11 CPUs: 160-167 256-263 352-359 448-455

and lscpu also reports:

Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455

Reported-by: Manjunatha H R <manjuhr1@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Trim / format change log]
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5b73151f 24-Apr-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: NMI IPI make NMI IPIs fully sychronous

There is an asynchronous aspect to smp_send_nmi_ipi. The caller waits
for all CPUs to call in to the handler, but it does not wait for
completion of the handler. This is a needless complication, so remove
it and always wait synchronously.

The synchronous wait allows the caller to easily time out and clear
the wait for completion (zero nmi_ipi_busy_count) in the case of badly
behaved handlers. This would have prevented the recent smp_send_stop
NMI IPI bug from causing the system to hang.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# de6e5d38 18-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: smp_send_stop do not offline stopped CPUs

Marking CPUs stopped by smp_send_stop as offline can cause warnings
due to cross-CPU wakeups. This trace was noticed on a busy system
running a sysrq+c crash test, after the injected crash:

WARNING: CPU: 51 PID: 1546 at kernel/sched/core.c:1179 set_task_cpu+0x22c/0x240
CPU: 51 PID: 1546 Comm: kworker/u352:1 Tainted: G D
Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core]
[...]
NIP [c00000000017c21c] set_task_cpu+0x22c/0x240
LR [c00000000017d580] try_to_wake_up+0x230/0x720
Call Trace:
[c000000001017700] runqueues+0x0/0xb00 (unreliable)
[c00000000017d580] try_to_wake_up+0x230/0x720
[c00000000015a214] insert_work+0x104/0x140
[c00000000015adb0] __queue_work+0x230/0x690
[c000003fc5007910] [c00000000015b26c] queue_work_on+0x5c/0x90
[c0080000135fc8f8] mlx5_cmd_exec+0x538/0xcb0 [mlx5_core]
[c008000013608fd0] mlx5_core_access_reg+0x140/0x1d0 [mlx5_core]
[c00800001362777c] mlx5e_update_pport_counters.constprop.59+0x6c/0x90 [mlx5_core]
[c008000013628868] mlx5e_update_ndo_stats+0x28/0x90 [mlx5_core]
[c008000013625558] mlx5e_update_stats_work+0x68/0xb0 [mlx5_core]
[c00000000015bcec] process_one_work+0x1bc/0x5f0
[c00000000015ecac] worker_thread+0xac/0x6b0
[c000000000168338] kthread+0x168/0x1b0
[c00000000000b628] ret_from_kernel_thread+0x5c/0xb4

This happens because firstly the CPU is not really offline in the
usual sense, processes and interrupts have not been migrated away.
Secondly smp_send_stop does not happen atomically on all CPUs, so
one CPU can have marked itself offline, while another CPU is still
running processes or interrupts which can affect the first CPU.

Fix this by just not marking the CPU as offline. It's more like
frozen in time, so offline does not really reflect its state properly
anyway. There should be nothing in the crash/panic path that walks
online CPUs and synchronously waits for them, so this change should
not introduce new hangs.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6ba55716 02-May-2018 Michael Ellerman <mpe@ellerman.id.au>

powerpc/nmi: Add an API for sending "safe" NMIs

Currently the options we have for sending NMIs are not necessarily
safe, that is they can potentially interrupt a CPU in a
non-recoverable region of code, meaning the kernel must then panic().

But we'd like to use smp_send_nmi_ipi() to do cross-CPU calls in
situations where we don't want to risk a panic(), because it doesn't
have the requirement that interrupts must be enabled like
smp_call_function().

So add an API for the caller to indicate that it wants to use the NMI
infrastructure, but doesn't want to do anything "unsafe".

Currently that is implemented by not actually calling cause_nmi_ipi(),
instead falling back to an IPI. In future we can pass the safe
parameter down to cause_nmi_ipi() and the individual backends can
potentially take it into account before deciding what to do.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>


# 21bfd6a8 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: move a stray NMI IPI case under NMI_IPI ifdef

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# bc907113 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: move timer broadcast code under GENERIC_CLOCKEVENTS_BROADCAST ifdef

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 3f984620 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: generic clockevents broadcast receiver call tick_receive_broadcast

The broadcast tick recipient can call tick_receive_broadcast rather
than re-running the full timer interrupt.

It does not have to check for the next event time, because the sender
already determined the timer has expired. It does not have to test
irq_work_pending, because that's a direct decrementer interrupt and
does not go through the clock events subsystem. And it does not have
to read PURR because that was removed with the previous patch.

This results in no code size change, but both the decrementer and
broadcast path lengths are reduced.

Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 424ef016 18-Apr-2018 Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

powerpc64/ftrace: Disable ftrace during hotplug

Disable ftrace when a cpu is about to go offline. When the cpu is woken
up, ftrace will get enabled in start_secondary().

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# d1039786 18-Apr-2018 Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

powerpc64/ftrace: Delay enabling ftrace on secondary cpus

On the boot cpu, though we enable paca->ftrace_enabled in early_setup()
(via cpu_ready_for_interrupts()), we don't start tracing until much
later since ftrace is not initialized yet and since we only support
DYNAMIC_FTRACE on powerpc. However, it is possible that ftrace has been
initialized by the time some of the secondary cpus start up. In this
case, we will try to trace some of the early boot code which can cause
problems.

To address this, move setting paca->ftrace_enabled from
cpu_ready_for_interrupts() to early_setup() for the boot cpu, and towards
the end of start_secondary() for secondary cpus.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6029755e 26-Apr-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: Fix deadlock with multiple calls to smp_send_stop

smp_send_stop can lock up the IPI path for any subsequent calls,
because the receiving CPUs spin in their handler function. This
started becoming a problem with the addition of an smp_send_stop
call in the reboot path, because panics can reboot after doing
their own smp_send_stop.

The NMI IPI variant was fixed with ac61c11566 ("powerpc: Fix
smp_send_stop NMI IPI handling"), which leaves the smp_call_function
variant.

This is fixed by having smp_send_stop only ever do the
smp_call_function once. This is a bit less robust than the NMI IPI
fix, because any other call to smp_call_function after smp_send_stop
could deadlock, but that has always been the case, and it was not
been a problem before.

Fixes: f2748bdfe1573 ("powerpc/powernv: Always stop secondaries before reboot/shutdown")
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ac61c115 24-Apr-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: Fix smp_send_stop NMI IPI handling

The NMI IPI handler for a receiving CPU increments nmi_ipi_busy_count
over the handler function call, which causes later smp_send_nmi_ipi()
callers to spin until the call is finished.

The stop_this_cpu() function never returns, so the busy count is never
decremeted, which can cause the system to hang in some cases. For
example panic() will call smp_send_stop() early on which calls
stop_this_cpu() on other CPUs, then later in the reboot path,
pnv_restart() will call smp_send_stop() again, which hangs.

Fix this by adding a special case to the stop_this_cpu() handler to
decrement the busy count, because it will never return.

Now that the NMI/non-NMI versions of stop_this_cpu() are different,
split them out into separate functions rather than doing #ifdef tricks
to share the body between the two functions.

Fixes: 6bed3237624e3 ("powerpc: use NMI IPI for smp_send_stop")
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out the functions, tweak change log a bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 855bfe0d 01-Apr-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: hard disable irqs in smp_send_stop loop

The hard lockup watchdog can fire under local_irq_disable
on platforms with irq soft masking.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6bed3237 01-Apr-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: use NMI IPI for smp_send_stop

Use the NMI IPI rather than smp_call_function for smp_send_stop.
Have stopped CPUs hard disable interrupts rather than just soft
disable.

This function is used in crash/panic/shutdown paths to bring other
CPUs down as quickly and reliably as possible, and minimizing their
potential to cause trouble.

Avoiding the Linux smp_call_function infrastructure and (if supported)
using true NMI IPIs makes this more robust.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# d2e60075 13-Feb-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/64: Use array of paca pointers and allocate pacas individually

Change the paca array into an array of pointers to pacas. Allocate
pacas individually.

This allows flexibility in where the PACAs are allocated. Future work
will allocate them node-local. Platforms that don't have address limits
on PACAs would be able to defer PACA allocations until later in boot
rather than allocate all possible ones up-front then freeing unused.

This is slightly more overhead (one additional indirection) for cross
CPU paca references, but those aren't too common.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 4145f358 15-Dec-2017 Balbir Singh <bsingharora@gmail.com>

powernv/kdump: Fix cases where the kdump kernel can get HMI's

Certain HMI's such as malfunction error propagate through
all threads/core on the system. If a thread was offline
prior to us crashing the system and jumping to the kdump
kernel, bad things happen when it wakes up due to an HMI
in the kdump kernel.

There are several possible ways to solve this problem

1. Put the offline cores in a state such that they are
not woken up for machine check and HMI errors. This
does not work, since we might need to wake up offline
threads to handle TB errors
2. Ignore HMI errors, setup HMEER to mask HMI errors,
but this still leads the window open for any MCEs
and masking them for the duration of the dump might
be a concern
3. Wake up offline CPUs, as in send them to
crash_ipi_callback (not wake them up as in mark them
online as seen by the hotplug). kexec does a
wake_online_cpus() call, this patch does something
similar, but instead sends an IPI and forces them to
crash_ipi_callback()

This patch takes approach #3.

Care is taken to enable this only for powenv platforms
via crash_wake_offline (a global value set at setup
time). The crash code sends out IPI's to all CPU's
which then move to crash_ipi_callback and kexec_smp_wait().

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 96d91431 29-Jun-2017 Oliver O'Halloran <oohall@gmail.com>

powerpc/smp: Add Power9 scheduler topology

In previous generations of Power processors each core had a private L2
cache. The Power 9 processor has a slightly different design where the
L2 cache is shared among pairs of cores rather than being completely
private.

Making the scheduler aware of this cache sharing allows the scheduler to
make better migration decisions. For example, if two CPU heavy tasks
share a core then one task can be migrated to the paired core to improve
throughput. Under the existing three level topology the task could be
migrated to any core on the same chip, while with the new topology it
would be preferentially migrated to the paired core so it remains
cache-hot.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2a636a56 29-Jun-2017 Oliver O'Halloran <oohall@gmail.com>

powerpc/smp: Add cpu_l2_cache_map

We want to add an extra level to the CPU scheduler topology to account
for cores which share a cache. To do this we need to build a cpumask
for each CPU that indicates which CPUs share this cache to use as an
input to the scheduler.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# df52f671 29-Jun-2017 Oliver O'Halloran <oohall@gmail.com>

powerpc/smp: Rework CPU topology construction

The CPU scheduler topology is constructed from a number of per-cpu
cpumasks which describe which sets of logical CPUs are related in some
fashion. Current code that handles constructing these masks when CPUs
are hot(un)plugged can be simplified a bit by exploiting the fact that
the scheduler requires higher levels of the toplogy (e.g package level
groupings) to be supersets of the lower levels (e.g. threas in a core).
This patch reworks the cpumask construction to be simpler and easier to
extend with extra topology levels.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Fix CONFIG_HOTPLUG_CPU=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e3d8b67e 29-Jun-2017 Oliver O'Halloran <oohall@gmail.com>

powerpc/smp: Use cpu_to_chip_id() to find core siblings

When building the CPU scheduler topology the kernel uses the ibm,chipid
property from the devicetree to group logical CPUs. Currently the DT
search for this property is open-coded in smp.c and this functionality
is a duplication of what's in cpu_to_chip_id() already. This patch
removes the existing search in favor of that.

It's worth mentioning that the semantics of the search are different
in cpu_to_chip_id(). When there is no ibm,chipid in the CPUs node it
will also search /cpus and / for the property, but this should not
effect the output topology.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 0459ddfd 09-Aug-2017 Nicholas Piggin <npiggin@gmail.com>

powerpc: NMI IPI improve lock primitive

When the NMI IPI lock is contended, spin at low SMT priority, using
loads only, and with interrupts enabled (where possible). This
improves behaviour under high contention (e.g., a system crash when
a number of CPUs are trying to enter the debugger).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 7b7622bb 27-Jul-2017 Michael Ellerman <mpe@ellerman.id.au>

powerpc/smp: Call smp_ops->setup_cpu() directly on the boot CPU

In smp_cpus_done() we need to call smp_ops->setup_cpu() for the boot
CPU, which means it has to run *on* the boot CPU.

In the past we ensured it ran on the boot CPU by changing the CPU
affinity mask of current directly. That was removed in commit
6d11b87d55eb ("powerpc/smp: Replace open coded task affinity logic"),
and replaced with a work queue call.

Unfortunately using a work queue leads to a lockdep warning, now that
the CPU hotplug lock is a regular semaphore:

======================================================
WARNING: possible circular locking dependency detected
...
kworker/0:1/971 is trying to acquire lock:
(cpu_hotplug_lock.rw_sem){++++++}, at: [<c000000000100974>] apply_workqueue_attrs+0x34/0xa0

but task is already holding lock:
((&wfc.work)){+.+.+.}, at: [<c0000000000fdb2c>] process_one_work+0x25c/0x800
...
CPU0 CPU1
---- ----
lock((&wfc.work));
lock(cpu_hotplug_lock.rw_sem);
lock((&wfc.work));
lock(cpu_hotplug_lock.rw_sem);

Although the deadlock can't happen in practice, because
smp_cpus_done() only runs in early boot before CPU hotplug is allowed,
lockdep can't tell that.

Luckily in commit 8fb12156b8db ("init: Pin init task to the boot CPU,
initially") tglx changed the generic code to pin init to the boot CPU
to begin with. The unpinning of init from the boot CPU happens in
sched_init_smp(), which is called after smp_cpus_done().

So smp_cpus_done() is always called on the boot CPU, which means we
don't need the work queue call at all - and the lockdep warning goes
away.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>


# 2104180a 12-Jul-2017 Nicholas Piggin <npiggin@gmail.com>

powerpc/64s: implement arch-specific hardlockup watchdog

Implement an arch-speicfic watchdog rather than use the perf-based
hardlockup detector.

The new watchdog takes the soft-NMI directly, rather than going through
perf. Perf interrupts are to be made maskable in future, so that would
prevent the perf detector from working in those regions.

Additionally, implement a SMP based detector where all CPUs watch one
another by pinging a shared cpumask. This is because powerpc Book3S
does not have a true periodic local NMI, but some platforms do implement
a true NMI IPI.

If a CPU is stuck with interrupts hard disabled, the soft-NMI watchdog
does not work, but the SMP watchdog will. Even on platforms without a
true NMI IPI to get a good trace from the stuck CPU, other CPUs will
notice the lockup sufficiently to report it and panic.

[npiggin@gmail.com: honor watchdog disable at boot/hotplug]
Link: http://lkml.kernel.org/r/20170621001346.5bb337c9@roar.ozlabs.ibm.com
[npiggin@gmail.com: fix false positive warning at CPU unplug]
Link: http://lkml.kernel.org/r/20170630080740.20766-1-npiggin@gmail.com
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170616065715.18390-6-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Tested-by: Babu Moger <babu.moger@oracle.com> [sparc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e287e65 06-Jun-2017 Nicholas Piggin <npiggin@gmail.com>

powerpc: use spin loop primitives in some functions

Use the different spin loop primitives in some simple powerpc
spin loops, including those which will spin as a common case.

This will help to test the spin loop primitives before more
conversions are done.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add some includes of <linux/processor.h>]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# c642af9c 26-Jun-2017 Santosh Sivaraj <santosh@fossix.org>

powerpc/smp: Convert NR_CPUS to nr_cpu_ids

nr_cpu_ids can be limited by nr_cpus boot parameter, whereas NR_CPUS is a
compile time constant, which shouldn't be compared against during cpu kick.

Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# f8d0d5dc 26-Jun-2017 Santosh Sivaraj <santosh@fossix.org>

powerpc/smp: Do not BUG_ON if invalid CPU during kick

During secondary start, we do not need to BUG_ON if an invalid CPU number
is passed. We already print an error if secondary cannot be started, so
just return an error instead.

Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a8fcfc19 16-May-2017 Thomas Gleixner <tglx@linutronix.de>

powerpc: Adjust system_state check

To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

Adjust the system_state check in smp_generic_cpu_bootable() to handle the
extra states.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20170516184735.359536998@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 687b8f24 15-Feb-2017 Michael Ellerman <mpe@ellerman.id.au>

powerpc/smp: Document irq enable/disable after migrating IRQs

This code was until recently completely undocumented and even now the comment is
not very verbose.

We've already had one patch sent to remove the IRQ enable/disable because it's
"paradoxical and unnecessary". So document it thoroughly to save anyone else
from puzzling over it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# c64af645 19-Dec-2016 Nicholas Piggin <npiggin@gmail.com>

powerpc: Add struct smp_ops_t.cause_nmi_ipi operation

Have the NMI IPI code use this op when the platform defines it.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ddd703ca 19-Dec-2016 Nicholas Piggin <npiggin@gmail.com>

powerpc: Add NMI IPI infrastructure

Add a simple NMI IPI system that handles concurrency and reentrancy.

The platform does not have to implement a true non-maskable interrupt,
the default is to simply use the debugger break IPI message. This has
now been co-opted for a general IPI message, and users (debugger and
crash) have been reimplemented on top of the NMI system.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Incorporate incremental fixes from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6d11b87d 12-Apr-2017 Thomas Gleixner <tglx@linutronix.de>

powerpc/smp: Replace open coded task affinity logic

Init task invokes smp_ops->setup_cpu() from smp_cpus_done(). Init task can
run on any online CPU at this point, but the setup_cpu() callback requires
to be invoked on the boot CPU. This is achieved by temporarily setting the
affinity of the calling user space thread to the requested CPU and reset it
to the original affinity afterwards.

That's racy vs. CPU hotplug and concurrent affinity settings for that
thread resulting in code executing on the wrong CPU and overwriting the
new affinity setting.

That's actually not a problem in this context as neither CPU hotplug nor
affinity settings can happen, but the access to task_struct::cpus_allowed
is about to restricted.

Replace it with a call to work_on_cpu_safe() which achieves the same result.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/20170412201042.518053336@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# b87ac021 13-Apr-2017 Nicholas Piggin <npiggin@gmail.com>

powerpc: Introduce msgsnd/doorbell barrier primitives

POWER9 changes requirements and adds new instructions for
synchronization.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# b866cc21 13-Apr-2017 Nicholas Piggin <npiggin@gmail.com>

powerpc: Change the doorbell IPI calling convention

Change the doorbell callers to know about their msgsnd addressing,
rather than have them set a per-cpu target data tag at boot that gets
sent to the cause_ipi functions. The data is only used for doorbell IPI
functions, no other IPI types, so it makes sense to keep that detail
local to doorbell.

Have the platform code understand doorbell IPIs, rather than the
interrupt controller code understand them. Platform code can look at
capabilities it has available and decide which to use.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a978e139 05-Apr-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: Remove migrate_irq() custom implementation

Some powerpc platforms use this to move IRQs away from a CPU being
unplugged. This function has several bugs such as not taking the right
locks or failing to NULL check pointers.

There's a new generic function doing exactly the same thing without all
the bugs, so let's use it instead.

mpe: The obvious place for the select of GENERIC_IRQ_MIGRATION is on
HOTPLUG_CPU, but that doesn't work. On some configs PM_SLEEP_SMP will
select HOTPLUG_CPU even though its dependencies are not met, which means
the select of GENERIC_IRQ_MIGRATION doesn't happen. That leads to the
build breaking. Fix it by moving the select of GENERIC_IRQ_MIGRATION to
SMP.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 14d4ae5c 05-Apr-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Add optional smp_ops->prepare_cpu SMP callback

Some platforms (will) need to perform allocations before bringing
a new CPU online. Doing it from smp_ops->setup_cpu is the wrong
thing to do:

- It has no useful failure path (too late)
- Calling any allocator will enable interrupts prematurely
causing problems with large decrementer among others

Instead, add a new callback that is called from __cpu_up (so from
the context trying to online the new CPU) at a point where we
can safely allocate and handle failures.

This will be used by XIVE support.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 68e21be2 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Move task->mm handling methods to <linux/sched/mm.h>

Move the following task->mm helper APIs into a new header file,
<linux/sched/mm.h>, to further reduce the size and complexity
of <linux/sched.h>.

Here are how the APIs are used in various kernel files:

# mm_alloc():
arch/arm/mach-rpc/ecard.c
fs/exec.c
include/linux/sched/mm.h
kernel/fork.c

# __mmdrop():
arch/arc/include/asm/mmu_context.h
include/linux/sched/mm.h
kernel/fork.c

# mmdrop():
arch/arm/mach-rpc/ecard.c
arch/m68k/sun3/mmu_emu.c
arch/x86/mm/tlb.c
drivers/gpu/drm/amd/amdkfd/kfd_process.c
drivers/gpu/drm/i915/i915_gem_userptr.c
drivers/infiniband/hw/hfi1/file_ops.c
drivers/vfio/vfio_iommu_spapr_tce.c
fs/exec.c
fs/proc/base.c
fs/proc/task_mmu.c
fs/proc/task_nommu.c
fs/userfaultfd.c
include/linux/mmu_notifier.h
include/linux/sched/mm.h
kernel/fork.c
kernel/futex.c
kernel/sched/core.c
mm/khugepaged.c
mm/ksm.c
mm/mmu_context.c
mm/mmu_notifier.c
mm/oom_kill.c
virt/kvm/kvm_main.c

# mmdrop_async_fn():
include/linux/sched/mm.h

# mmdrop_async():
include/linux/sched/mm.h
kernel/fork.c

# mmget_not_zero():
fs/userfaultfd.c
include/linux/sched/mm.h
mm/oom_kill.c

# mmput():
arch/arc/include/asm/mmu_context.h
arch/arc/kernel/troubleshoot.c
arch/frv/mm/mmu-context.c
arch/powerpc/platforms/cell/spufs/context.c
arch/sparc/include/asm/mmu_context_32.h
drivers/android/binder.c
drivers/gpu/drm/etnaviv/etnaviv_gem.c
drivers/gpu/drm/i915/i915_gem_userptr.c
drivers/infiniband/core/umem.c
drivers/infiniband/core/umem_odp.c
drivers/infiniband/core/uverbs_main.c
drivers/infiniband/hw/mlx4/main.c
drivers/infiniband/hw/mlx5/main.c
drivers/infiniband/hw/usnic/usnic_uiom.c
drivers/iommu/amd_iommu_v2.c
drivers/iommu/intel-svm.c
drivers/lguest/lguest_user.c
drivers/misc/cxl/fault.c
drivers/misc/mic/scif/scif_rma.c
drivers/oprofile/buffer_sync.c
drivers/vfio/vfio_iommu_type1.c
drivers/vhost/vhost.c
drivers/xen/gntdev.c
fs/exec.c
fs/proc/array.c
fs/proc/base.c
fs/proc/task_mmu.c
fs/proc/task_nommu.c
fs/userfaultfd.c
include/linux/sched/mm.h
kernel/cpuset.c
kernel/events/core.c
kernel/events/uprobes.c
kernel/exit.c
kernel/fork.c
kernel/ptrace.c
kernel/sys.c
kernel/trace/trace_output.c
kernel/tsacct.c
mm/memcontrol.c
mm/memory.c
mm/mempolicy.c
mm/migrate.c
mm/mmu_notifier.c
mm/nommu.c
mm/oom_kill.c
mm/process_vm_access.c
mm/rmap.c
mm/swapfile.c
mm/util.c
virt/kvm/async_pf.c

# mmput_async():
include/linux/sched/mm.h
kernel/fork.c
mm/oom_kill.c

# get_task_mm():
arch/arc/kernel/troubleshoot.c
arch/powerpc/platforms/cell/spufs/context.c
drivers/android/binder.c
drivers/gpu/drm/etnaviv/etnaviv_gem.c
drivers/infiniband/core/umem.c
drivers/infiniband/core/umem_odp.c
drivers/infiniband/hw/mlx4/main.c
drivers/infiniband/hw/mlx5/main.c
drivers/infiniband/hw/usnic/usnic_uiom.c
drivers/iommu/amd_iommu_v2.c
drivers/iommu/intel-svm.c
drivers/lguest/lguest_user.c
drivers/misc/cxl/fault.c
drivers/misc/mic/scif/scif_rma.c
drivers/oprofile/buffer_sync.c
drivers/vfio/vfio_iommu_type1.c
drivers/vhost/vhost.c
drivers/xen/gntdev.c
fs/proc/array.c
fs/proc/base.c
fs/proc/task_mmu.c
include/linux/sched/mm.h
kernel/cpuset.c
kernel/events/core.c
kernel/exit.c
kernel/fork.c
kernel/ptrace.c
kernel/sys.c
kernel/trace/trace_output.c
kernel/tsacct.c
mm/memcontrol.c
mm/memory.c
mm/mempolicy.c
mm/migrate.c
mm/mmu_notifier.c
mm/nommu.c
mm/util.c

# mm_access():
fs/proc/base.c
include/linux/sched/mm.h
kernel/fork.c
mm/process_vm_access.c

# mm_release():
arch/arc/include/asm/mmu_context.h
fs/exec.c
include/linux/sched/mm.h
include/uapi/linux/sched.h
kernel/exit.c
kernel/fork.c

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 105ab3d8 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h>

We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0c98d344 05-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/core: Remove the tsk_cpus_allowed() wrapper

So the original intention of tsk_cpus_allowed() was to 'future-proof'
the field - but it's pretty ineffectual at that, because half of
the code uses ->cpus_allowed directly ...

Also, the wrapper makes the code longer than the original expression!

So just get rid of it. This also shrinks <linux/sched.h> a bit.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f1f10076 27-Feb-2017 Vegard Nossum <vegard.nossum@oracle.com>

mm: add new mmgrab() helper

Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da665885 29-Nov-2016 Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>

powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.

Commit 2965faa5e03d ("kexec: split kexec_load syscall from kexec core
code") introduced CONFIG_KEXEC_CORE so that CONFIG_KEXEC means whether
the kexec_load system call should be compiled-in and CONFIG_KEXEC_FILE
means whether the kexec_file_load system call should be compiled-in.
These options can be set independently from each other.

Since until now powerpc only supported kexec_load, CONFIG_KEXEC and
CONFIG_KEXEC_CORE were synonyms. That is not the case anymore, so we
need to make a distinction. Almost all places where CONFIG_KEXEC was
being used should be using CONFIG_KEXEC_CORE instead, since
kexec_file_load also needs that code compiled in.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 19ab58d1 16-Aug-2016 Boqun Feng <boqun.feng@gmail.com>

powerpc, hotplug: Avoid to touch non-existent cpumasks.

We observed a kernel oops when running a PPC guest with config NR_CPUS=4
and qemu option "-smp cores=1,threads=8":

[ 30.634781] Unable to handle kernel paging request for data at
address 0xc00000014192eb17
[ 30.636173] Faulting instruction address: 0xc00000000003e5cc
[ 30.637069] Oops: Kernel access of bad area, sig: 11 [#1]
[ 30.637877] SMP NR_CPUS=4 NUMA pSeries
[ 30.638471] Modules linked in:
[ 30.638949] CPU: 3 PID: 27 Comm: migration/3 Not tainted
4.7.0-07963-g9714b26 #1
[ 30.640059] task: c00000001e29c600 task.stack: c00000001e2a8000
[ 30.640956] NIP: c00000000003e5cc LR: c00000000003e550 CTR:
0000000000000000
[ 30.642001] REGS: c00000001e2ab8e0 TRAP: 0300 Not tainted
(4.7.0-07963-g9714b26)
[ 30.643139] MSR: 8000000102803033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[E]> CR: 22004084 XER: 00000000
[ 30.644583] CFAR: c000000000009e98 DAR: c00000014192eb17 DSISR: 40000000 SOFTE: 0
GPR00: c00000000140a6b8 c00000001e2abb60 c0000000016dd300 0000000000000003
GPR04: 0000000000000000 0000000000000004 c0000000016e5920 0000000000000008
GPR08: 0000000000000004 c00000014192eb17 0000000000000000 0000000000000020
GPR12: c00000000140a6c0 c00000000ffffc00 c0000000000d3ea8 c00000001e005680
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 c00000001e6b3a00 0000000000000000 0000000000000001
GPR24: c00000001ff85138 c00000001ff85130 000000001eb6f000 0000000000000001
GPR28: 0000000000000000 c0000000017014e0 0000000000000000 0000000000000018
[ 30.653882] NIP [c00000000003e5cc] __cpu_disable+0xcc/0x190
[ 30.654713] LR [c00000000003e550] __cpu_disable+0x50/0x190
[ 30.655528] Call Trace:
[ 30.655893] [c00000001e2abb60] [c00000000003e550] __cpu_disable+0x50/0x190 (unreliable)
[ 30.657280] [c00000001e2abbb0] [c0000000000aca0c] take_cpu_down+0x5c/0x100
[ 30.658365] [c00000001e2abc10] [c000000000163918] multi_cpu_stop+0x1a8/0x1e0
[ 30.659617] [c00000001e2abc60] [c000000000163cc0] cpu_stopper_thread+0xf0/0x1d0
[ 30.660737] [c00000001e2abd20] [c0000000000d8d70] smpboot_thread_fn+0x290/0x2a0
[ 30.661879] [c00000001e2abd80] [c0000000000d3fa8] kthread+0x108/0x130
[ 30.662876] [c00000001e2abe30] [c000000000009968] ret_from_kernel_thread+0x5c/0x74
[ 30.664017] Instruction dump:
[ 30.664477] 7bde1f24 38a00000 787f1f24 3b600001 39890008 7d204b78 7d05e214 7d0b07b4
[ 30.665642] 796b1f24 7d26582a 7d204a14 7d29f214 <7d4048a8> 7d4a3878 7d4049ad 40c2fff4
[ 30.666854] ---[ end trace 32643b7195717741 ]---

The reason of this is that in __cpu_disable(), when we try to set the
cpu_sibling_mask or cpu_core_mask of the sibling CPUs of the disabled
one, we don't check whether the current configuration employs those
sibling CPUs(hw threads). And if a CPU is not employed by a
configuration, the percpu structures cpu_{sibling,core}_mask are not
allocated, therefore accessing those cpumasks will result in problems as
above.

This patch fixes this problem by adding an addition check on whether the
id is no less than nr_cpu_ids in the sibling CPU iteration code.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b92a226e 23-Jul-2016 Kevin Hao <haokexin@gmail.com>

powerpc: Move cpu_has_feature() to a separate file

We plan to use jump label for cpu_has_feature(). In order to implement
this we need to include the linux/jump_label.h in asm/cputable.h.

Unfortunately if we do that it leads to an include loop. The root of the
problem seems to be that reg.h needs cputable.h (for CPU_FTRs), and then
cputable.h via jump_label.h eventually pulls in hw_irq.h which needs
reg.h (for MSR_EE).

So move cpu_has_feature() to a separate file on its own.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rename to cpu_has_feature.h and flesh out change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# f8ab4810 02-Jun-2016 Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>

powerpc: export cpu_to_core_id()

Export cpu_to_core_id(). This will be used by the lpfc driver.

This enables topology_core_id() from <linux/topology.h> (defined
to cpu_to_core_id() in arch/powerpc/include/asm/topology.h) to be
used by (non-builtin) modules.

That is arch-neutral, already used by eg, drivers/base/topology.c,
but it is builtin (obj-y in Makefile) thus didn't need the export.

Since the module uses topology_core_id() and this is defined to
cpu_to_core_id(), it needs the export, otherwise:

ERROR: "cpu_to_core_id" [drivers/scsi/lpfc/lpfc.ko] undefined!

Tested on next-20160601.

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 665e87ff 17-May-2016 Daniel Axtens <dja@axtens.net>

powerpc/sparse: Include headers containing prototypes

Sometimes headers that provide prototypes for functions are
accidentally omitted from the files that define the functions.

Fix a couple of times that occurs.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 42f5b4ca 17-May-2016 Daniel Axtens <dja@axtens.net>

powerpc: Introduce asm-prototypes.h

Sparse picked up a number of functions that are implemented in C and
then only referred to in asm code.

This introduces asm-prototypes.h, which provides a place for
prototypes of these functions.

This silences some sparse warnings.

Signed-off-by: Daniel Axtens <dja@axtens.net>
[mpe: Add include guards, clean up copyright & GPL text]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e9d867a6 09-Mar-2016 Peter Zijlstra (Intel) <peterz@infradead.org>

sched: Allow per-cpu kernel threads to run on online && !active

In order to enable symmetric hotplug, we must mirror the online &&
!active state of cpu-down on the cpu-up side.

However, to retain sanity, limit this state to per-cpu kthreads.

Aside from the change to set_cpus_allowed_ptr(), which allow moving
the per-cpu kthreads on, the other critical piece is the cpu selection
for pinned tasks in select_task_rq(). This avoids dropping into
select_fallback_rq().

select_fallback_rq() cannot be allowed to select !active cpus because
its used to migrate user tasks away. And we do not want to move user
tasks onto cpus that are in transition.

Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160301152303.GV6356@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 2f4f1f81 20-Nov-2015 chenhui zhao <chenhui.zhao@freescale.com>

powerpc/mpc85xx: Add hotplug support on E5500 and E500MC cores

Freescale E500MC and E5500 core-based platforms, like P4080, T1040,
support disabling/enabling CPU dynamically.
This patch adds this feature on those platforms.

Signed-off-by: Chenhui Zhao <chenhui.zhao@freescale.com>
Signed-off-by: Tang Yuantian <Yuantian.Tang@feescale.com>
[scottwood: removed unused pr_fmt]
Signed-off-by: Scott Wood <oss@buserror.net>


# fc6d73d6 26-Feb-2016 Thomas Gleixner <tglx@linutronix.de>

arch/hotplug: Call into idle with a proper state

Let the non boot cpus call into idle with the corresponding hotplug state, so
the hotplug core can handle the further bringup. That's a first step to
convert the boot side of the hotplugged cpus to do all the synchronization
with the other side through the state machine. For now it'll only start the
hotplug thread and kick the full bringup of the cpu.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.614102639@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# e17769eb 21-Dec-2015 Suresh E. Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Send IPI to host core to wake VCPU

This patch adds support to real-mode KVM to search for a core
running in the host partition and send it an IPI message with
VCPU to be woken. This avoids having to switch to the host
partition to complete an H_IPI hypercall when the VCPU which
is the target of the the H_IPI is not loaded (is not running
in the guest).

The patch also includes the support in the IPI handler running
in the host to do the wakeup by calling kvmppc_xics_ipi_action
for the PPC_MSG_RM_HOST_ACTION message.

When a guest is being destroyed, we need to ensure that there
are no pending IPIs waiting to wake up a VCPU before we free
the VCPUs of the guest. This is accomplished by:
- Forces a PPC_MSG_CALL_FUNCTION IPI to be completed by all CPUs
before freeing any VCPUs in kvm_arch_destroy_vm().
- Any PPC_MSG_RM_HOST_ACTION messages must be executed first
before any other PPC_MSG_CALL_FUNCTION messages.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 31639c77 17-Dec-2015 Suresh Warrier <warrier@linux.vnet.ibm.com>

powerpc/smp: Add smp_muxed_ipi_set_message

smp_muxed_ipi_message_pass() invokes smp_ops->cause_ipi, which
uses an ioremapped address to access registers on the XICS
interrupt controller to cause the IPI. Because of this real
mode callers cannot call smp_muxed_ipi_message_pass() for IPI
messaging.

This patch creates a separate function smp_muxed_ipi_set_message
just to set the IPI message without the cause_ipi routine.
After calling this function to set the IPI message, real
mode callers must cause the IPI by writing to the XICS registers
directly.

As part of this, we also change smp_muxed_ipi_message_pass
to call smp_muxed_ipi_set_message to set the message instead
of doing it directly inside the routine.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# bd7f561f 17-Dec-2015 Suresh Warrier <warrier@linux.vnet.ibm.com>

powerpc/smp: Support more IPI messages

This patch increases the number of demuxed messages for a
controller with a single ipi to 8 for 64-bit systems.

This is required because we want to use the IPI mechanism
to send messages from a CPU running in KVM real mode in a
guest to a CPU in the host to take some action. Currently,
we only support 4 messages and all 4 are already taken.

Define a fifth message PPC_MSG_RM_HOST_ACTION for this
purpose.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 875ebe94 23-Feb-2015 Michael Ellerman <mpe@ellerman.id.au>

powerpc/smp: Wait until secondaries are active & online

Anton has a busy ppc64le KVM box where guests sometimes hit the infamous
"kernel BUG at kernel/smpboot.c:134!" issue during boot:

BUG_ON(td->cpu != smp_processor_id());

Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
output confirms it:

CPU: 0
Comm: watchdog/130

The problem is that we aren't ensuring the CPU active bit is set for the
secondary before allowing the master to continue on. The master unparks
the secondary CPU's kthreads and the scheduler looks for a CPU to run
on. It calls select_task_rq() and realises the suggested CPU is not in
the cpus_allowed mask. It then ends up in select_fallback_rq(), and
since the active bit isnt't set we choose some other CPU to run on.

This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug
vs. set_cpus_allowed_ptr()", which changed from setting active before
online to setting active after online. However that was in turn fixing a
bug where other code assumed an active CPU was also online, so we can't
just revert that fix.

The simplest fix is just to spin waiting for both active & online to be
set. We already have a barrier prior to set_cpu_online() (which also
sets active), to ensure all other setup is completed before online &
active are set.

Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8aa989b8 26-Jan-2015 Michael Ellerman <mpe@ellerman.id.au>

powerpc: Remove some unused functions

Remove slice_set_psize() which is not used.

It was added in 3a8247cc2c85 "powerpc: Only demote individual slices
rather than whole process" but was never used.

Remove vsx_assist_exception() which is not used.

It was added in ce48b2100785 "powerpc: Add VSX context save/restore,
ptrace and signal support" but was never used.

Remove generic_mach_cpu_die() which is not used.

Its last caller was removed in 375f561a4131 "powerpc/powernv: Always go
into nap mode when CPU is offline".

Remove mpc7448_hpc2_power_off() and mpc7448_hpc2_halt() which are
unused.

These were introduced in c5d56332fd6c "[POWERPC] Add general support for
mpc7448hpc2 (Taiga) platform" but were never used.

This was partially found by using a static code analysis program called
cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
[mpe: Update changelog with details on when/why they are unused]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 1be6f10f 28-Dec-2014 Michael Ellerman <mpe@ellerman.id.au>

Revert "powerpc: Secondary CPUs must set cpu_callin_map after setting active and online"

This reverts commit 7c5c92ed56d932b2c19c3f8aea86369509407d33.

Although this did fix the bug it was aimed at, it also broke secondary
startup on platforms that use give/take_timebase(). Unfortunately we
didn't detect that while it was in next.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 7c5c92ed 08-Dec-2014 Anton Blanchard <anton@samba.org>

powerpc: Secondary CPUs must set cpu_callin_map after setting active and online

I have a busy ppc64le KVM box where guests sometimes hit the infamous
"kernel BUG at kernel/smpboot.c:134!" issue during boot:

BUG_ON(td->cpu != smp_processor_id());

Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
output confirms it:

CPU: 0
Comm: watchdog/130

The problem is that we aren't ensuring the CPU active and online bits are set
before allowing the master to continue on. The master unparks the secondary
CPUs kthreads and the scheduler looks for a CPU to run on. It calls
select_task_rq and realises the suggested CPU is not in the cpus_allowed
mask. It then ends up in select_fallback_rq, and since the active and
online bits aren't set we choose some other CPU to run on.

Cc: stable@vger.kernel.org
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 69111bac 21-Oct-2014 Christoph Lameter <cl@linux.com>

powerpc: Replace __get_cpu_var uses

This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.

V2->V2
- Fix up to work against 3.18-rc1

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

Converts to

int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

Converts to

memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

Converts to

__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

Converts to

__this_cpu_inc(y)

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
[mpe: Fix build errors caused by set/or_softirq_pending(), and rework
assignment in __set_breakpoint() to use memcpy().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# bc3c4327 27-Aug-2014 Li Zhong <zhong@linux.vnet.ibm.com>

powerpc: Only set numa node information for present cpus at boottime

As Nish suggested, it makes more sense to init the numa node informatiion
for present cpus at boottime, which could also avoid WARN_ON(1) in
numa_setup_cpu().

With this change, we also need to change the smp_prepare_cpus() to set up
numa information only on present cpus.

For those possible, but not present cpus, their numa information
will be set up after they are started, as the original code did before commit
2fabf084b6ad.

Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 1217d34b 19-Aug-2014 Anton Blanchard <anton@samba.org>

powerpc: Ensure global functions include their prototype

Fix a number of places where global functions were not including
their prototype. This ensures the prototype and the function match.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 23f66e2d 27-Aug-2014 Tejun Heo <tj@kernel.org>

Revert "powerpc: Replace __get_cpu_var uses"

This reverts commit 5828f666c069af74e00db21559f1535103c9f79a due to
build failure after merging with pending powerpc changes.

Link: http://lkml.kernel.org/g/20140827142243.6277eaff@canb.auug.org.au

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 5828f666 16-Aug-2014 Christoph Lameter <cl@linux.com>

powerpc: Replace __get_cpu_var uses

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

Converts to

int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

Converts to

memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

Converts to

__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

Converts to

__this_cpu_inc(y)

tj: Folded a fix patch.
http://lkml.kernel.org/g/alpine.DEB.2.11.1408172143020.9652@gentwo.org

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 2fabf084 17-Jul-2014 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc: reorder per-cpu NUMA information's initialization

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10

the slab consumption decreases from

Slab: 932416 kB
SUnreclaim: 902336 kB

to

Slab: 395264 kB
SUnreclaim: 359424 kB

And we a corresponding increase in the slab efficiency from

slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%

to

slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b6220ad6 24-Jun-2014 Guenter Roeck <linux@roeck-us.net>

sched: Fix compiler warnings

Commit 143e1e28cb (sched: Rework sched_domain topology definition)
introduced a number of functions with a return value of 'const int'.
gcc doesn't know what to do with that and, if the kernel is compiled
with W=1, complains with the following warnings whenever sched.h
is included.

include/linux/sched.h:875:25: warning: type qualifiers ignored on function return type
include/linux/sched.h:882:25: warning: type qualifiers ignored on function return type
include/linux/sched.h:889:25: warning: type qualifiers ignored on function return type
include/linux/sched.h:1002:21: warning: type qualifiers ignored on function return type

Commits fb2aa855 (sched, ARM: Create a dedicated scheduler topology table)
and 607b45e9a (sched, powerpc: Create a dedicated topology table) introduce
the same warning in the arm and powerpc code.

Drop 'const' from the function declarations to fix the problem.

The fix for all three patches has to be applied together to avoid
compilation failures for the affected architectures.

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403658329-13196-1-git-send-email-linux@roeck-us.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5d4dfddd 27-May-2014 Nicolas Pitre <nico@fluxnic.net>

sched: Rename capacity related flags

It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

Let's rename the following feature flags since they do relate to capacity:

SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY
ARCH_POWER -> ARCH_CAPACITY
NONTASK_POWER -> NONTASK_CAPACITY

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Andy Fleming <afleming@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: devicetree@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 6f5e40a3 23-May-2014 Michael Ellerman <mpe@ellerman.id.au>

powerpc: Check cpu_thread_in_subcore() in __cpu_up()

To support split core we need to change the check in __cpu_up() that
determines if a cpu is allowed to come online.

Currently we refuse to online cpus which are not the primary thread
within their core.

On POWER8 with split core support this check needs to instead refuse to
online cpus which are not the primary thread within their *sub* core.

On POWER7 and other systems that do not support split core,
threads_per_subcore == threads_per_core and so the check is equivalent.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 441c19c8 23-May-2014 Michael Ellerman <mpe@ellerman.id.au>

powerpc/kvm/book3s_hv: Rework the secondary inhibit code

As part of the support for split core on POWER8, we want to be able to
block splitting of the core while KVM VMs are active.

The logic to do that would be exactly the same as the code we currently
have for inhibiting onlining of secondaries.

Instead of adding an identical mechanism to block split core, rework the
secondary inhibit code to be a "HV KVM is active" check. We can then use
that in both the cpu hotplug code and the upcoming split core code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 64bb80d8 16-May-2014 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES

Based off fd1197f1 for ia64, enable CONFIG_HAVE_MEMORYLESS_NODES if
NUMA. Initialize the local memory node in start_secondary.

With this commit and the preceding to enable
CONFIG_USER_PERCPU_NUMA_NODE_ID, which is a prerequisite, in a PowerKVM
guest with the following topology:

numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
95 96 97 98 99
node 0 size: 1998 MB
node 0 free: 521 MB
node 1 cpus: 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
187 188 189 190 191 192 193 194 195 196 197 198 199
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 2039 MB
node 2 free: 1739 MB
node distances:
node 0 1 2
0: 10 40 40
1: 40 10 40
2: 40 40 10

the unreclaimable slab is reduced by close to 130M:

Before:
Slab: 418176 kB
SReclaimable: 26624 kB
SUnreclaim: 391552 kB

After:
Slab: 298944 kB
SReclaimable: 31744 kB
SUnreclaim: 267200 kB

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8c272261 19-May-2014 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID

Based off 3bccd996 for ia64, convert powerpc to use the generic per-CPU
topology tracking, specifically:

initialize per cpu numa_node entry in start_secondary
remove the powerpc cpu_to_node()
define CONFIG_USE_PERCPU_NUMA_NODE_ID if NUMA

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 607b45e9 11-Apr-2014 Vincent Guittot <vincent.guittot@linaro.org>

sched, powerpc: Create a dedicated topology table

Create a dedicated topology table for handling asymetric feature of powerpc.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Fleming <afleming@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: schwidefsky@de.ibm.com
Cc: cmetcalf@tilera.com
Cc: dietmar.eggemann@arm.com
Cc: devicetree@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1397209481-28542-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 1b67bee1 25-Feb-2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

powerpc: Implement tick broadcast IPI as a fixed IPI message

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 402d9a1e 25-Feb-2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# f8a1883a 11-Dec-2013 Anton Blanchard <anton@samba.org>

powerpc: Fix topology core_id endian issue on LE builds

cpu_to_core_id() is missing a byteswap:

cat /sys/devices/system/cpu/cpu63/topology/core_id
201326592

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# dfee0efe 22-Jul-2013 Chen Gang <gang.chen@asianux.com>

powerpc: kernel: remove useless code which related with 'max_cpus'

Since not need 'max_cpus' after the related commit, the related code
are useless too, need be removed.

The related commit:

c1aa687 powerpc: Clean up obsolete code relating to decrementer and timebase

The related warning:

arch/powerpc/kernel/smp.c:323:43: warning: parameter ‘max_cpus’ set but not used [-Wunused-but-set-parameter]

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 3eb906c6 19-Nov-2013 Michael Ellerman <mpe@ellerman.id.au>

powerpc: Make cpu_to_chip_id() available when SMP=n

Up until now we have only used cpu_to_chip_id() in the topology code,
which is only used on SMP builds. However my recent commit a4da0d5
"Implement arch_get_random_long/int() for powernv" added a usage when
SMP=n, breaking the build.

Move cpu_to_chip_id() into prom.c so it is available for SMP=n builds.

We would move the extern to prom.h, but that breaks the include in
topology.h. Instead we leave it in smp.h, but move it out of the
CONFIG_SMP #ifdef. We also need to include asm/smp.h in rng.c, because
the linux version skips asm/smp.h on UP. What a mess.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 6dedcca6 25-Sep-2013 Toshi Kani <toshi.kani@hp.com>

hotplug, powerpc, x86: Remove cpu_hotplug_driver_lock()

cpu_hotplug_driver_lock() serializes CPU online/offline operations
when ARCH_CPU_PROBE_RELEASE is set. This lock interface is no longer
necessary with the following reason:

- lock_device_hotplug() now protects CPU online/offline operations,
including the probe & release interfaces enabled by
ARCH_CPU_PROBE_RELEASE. The use of cpu_hotplug_driver_lock() is
redundant.
- cpu_hotplug_driver_lock() is only valid when ARCH_CPU_PROBE_RELEASE
is defined, which is misleading and is only enabled on powerpc.

This patch removes the cpu_hotplug_driver_lock() interface. As
a result, ARCH_CPU_PROBE_RELEASE only enables / disables the cpu
probe & release interface as intended. There is no functional change
in this patch.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 256588fd 09-Sep-2013 Guenter Roeck <linux@roeck-us.net>

powerpc: Export cpu_to_chip_id() to fix build error

powerpc allmodconfig build fails with:

ERROR: ".cpu_to_chip_id" [drivers/block/mtip32xx/mtip32xx.ko] undefined!

The problem was introduced with commit 15863ff3b (powerpc: Make chip-id
information available to userspace).

Export the missing symbol.

Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 0654de1c 06-Aug-2013 Anton Blanchard <anton@samba.org>

powerpc: Little endian SMP IPI demux

Add little endian support for demuxing SMP IPIs

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 15863ff3 12-Aug-2013 Vasant Hegde <hegdevasant@linux.vnet.ibm.com>

powerpc: Make chip-id information available to userspace

So far "/sys/devices/system/cpu/cpuX/topology/physical_package_id"
was always default (-1) on ppc64 architecture.

Now, some systems have an ibm,chip-id property in the cpu nodes in
the device tree. On these systems, we now use this information to
display physical_package_id.

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 256f2d4b 12-Aug-2013 Paul Mackerras <paulus@samba.org>

powerpc: Use ibm, chip-id property to compute cpu_core_mask if available

Some systems have an ibm,chip-id property in the cpu nodes in the
device tree. On these systems, we now use that to compute the
cpu_core_mask (i.e. the set of core siblings) rather than looking
at cache properties.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# a8a5356c 12-Aug-2013 Paul Mackerras <paulus@samba.org>

powerpc: Pull out cpu_core_mask updates into a separate function

This factors out the details of updating cpu_core_mask into a separate
function, to make it easier to change how the mask is calculated later.
This makes no functional change.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 3cd85250 05-Aug-2013 Andy Fleming <afleming@freescale.com>

powerpc: Add smp_generic_cpu_bootable

Cell and PSeries both implemented their own versions of a
cpu_bootable smp_op which do the same thing (well, the PSeries
one has support for more than 2 threads). Copy the PSeries one
to generic code, and rename it smp_generic_cpu_bootable.

Signed-off-by: Andy Fleming <afleming@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b0d436c7 06-Aug-2013 Anton Blanchard <anton@samba.org>

powerpc: Fix a number of sparse warnings

Address some of the trivial sparse warnings in arch/powerpc.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# cce606fe 16-May-2013 Li Zhong <zhong@linux.vnet.ibm.com>

powerpc: Set cpu sibling mask before online cpu

It seems following race is possible:

cpu0 cpux
smp_init->cpu_up->_cpu_up
__cpu_up
kick_cpu(1)
-------------------------------------------------------------------------
waiting online ...
... notify CPU_STARTING
set cpux active
set cpux online
-------------------------------------------------------------------------
finish waiting online
...
sched_init_smp
init_sched_domains(cpu_active_mask)
build_sched_domains
set cpux sibling info
-------------------------------------------------------------------------

Execution of cpu0 and cpux could be concurrent between two separator
lines.

So if the cpux sibling information was set too late (normally
impossible, but could be triggered by adding some delay in
start_secondary, after setting cpu online), build_sched_domains()
running on cpu0 might see cpux active, with an empty sibling mask, then
cause some bad address accessing like following:

[ 0.099855] Unable to handle kernel paging request for data at address 0xc00000038518078f
[ 0.099868] Faulting instruction address: 0xc0000000000b7a64
[ 0.099883] Oops: Kernel access of bad area, sig: 11 [#1]
[ 0.099895] PREEMPT SMP NR_CPUS=16 DEBUG_PAGEALLOC NUMA pSeries
[ 0.099922] Modules linked in:
[ 0.099940] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-rc1-00120-gb973425-dirty #16
[ 0.099956] task: c0000001fed80000 ti: c0000001fed7c000 task.ti: c0000001fed7c000
[ 0.099971] NIP: c0000000000b7a64 LR: c0000000000b7a40 CTR: c0000000000b4934
[ 0.099985] REGS: c0000001fed7f760 TRAP: 0300 Not tainted (3.10.0-rc1-00120-gb973425-dirty)
[ 0.099997] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 24272828 XER: 20000003
[ 0.100045] SOFTE: 1
[ 0.100053] CFAR: c000000000445ee8
[ 0.100064] DAR: c00000038518078f, DSISR: 40000000
[ 0.100073]
GPR00: 0000000000000080 c0000001fed7f9e0 c000000000c84d48 0000000000000010
GPR04: 0000000000000010 0000000000000000 c0000001fc55e090 0000000000000000
GPR08: ffffffffffffffff c000000000b80b30 c000000000c962d8 00000003845ffc5f
GPR12: 0000000000000000 c00000000f33d000 c00000000000b9e4 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
GPR20: c000000000ccf750 0000000000000000 c000000000c94d48 c0000001fc504000
GPR24: c0000001fc504000 c0000001fecef848 c000000000c94d48 c000000000ccf000
GPR28: c0000001fc522090 0000000000000010 c0000001fecef848 c0000001fed7fae0
[ 0.100293] NIP [c0000000000b7a64] .get_group+0x84/0xc4
[ 0.100307] LR [c0000000000b7a40] .get_group+0x60/0xc4
[ 0.100318] Call Trace:
[ 0.100332] [c0000001fed7f9e0] [c0000000000dbce4] .lock_is_held+0xa8/0xd0 (unreliable)
[ 0.100354] [c0000001fed7fa70] [c0000000000bf62c] .build_sched_domains+0x728/0xd14
[ 0.100375] [c0000001fed7fbe0] [c000000000af67bc] .sched_init_smp+0x4fc/0x654
[ 0.100394] [c0000001fed7fce0] [c000000000adce24] .kernel_init_freeable+0x17c/0x30c
[ 0.100413] [c0000001fed7fdb0] [c00000000000ba08] .kernel_init+0x24/0x12c
[ 0.100431] [c0000001fed7fe30] [c000000000009f74] .ret_from_kernel_thread+0x5c/0x68
[ 0.100445] Instruction dump:
[ 0.100456] 38800010 38a00000 4838e3f5 60000000 7c6307b4 2fbf0000 419e0040 3d220001
[ 0.100496] 78601f24 39491590 e93e0008 7d6a002a <7d69582a> f97f0000 7d4a002a e93e0010
[ 0.100559] ---[ end trace 31fd0ba7d8756001 ]---

This patch tries to move the sibling maps updating before
notify_cpu_starting() and cpu online, and a write barrier there to make
sure sibling maps are updated before active and online mask.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 061d19f2 24-Jun-2013 Paul Gortmaker <paul.gortmaker@windriver.com>

powerpc: Delete __cpuinit usage from all users

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the powerpc uses of the __cpuinit macros. There
are no __CPUINIT users in assembly files in powerpc.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 799fef06 21-Mar-2013 Thomas Gleixner <tglx@linutronix.de>

powerpc: Use generic idle loop

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Link: http://lkml.kernel.org/r/20130321215235.026838003@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 174ea471 04-Feb-2013 Daniel Borkmann <daniel@iogearbox.net>

powerpc: fix ics_rtas_init and start_secondary section mismatch

It seems, we're fine with just annotating the two functions.
Thus, this fixes the following build warnings on ppc64:

WARNING: arch/powerpc/sysdev/xics/built-in.o(.text+0x1664):
The function .ics_rtas_init() references
the function __init .xics_register_ics().
This is often because .ics_rtas_init lacks a __init
annotation or the annotation of .xics_register_ics is wrong.

WARNING: arch/powerpc/sysdev/built-in.o(.text+0x6044):
The function .ics_rtas_init() references
the function __init .xics_register_ics().
This is often because .ics_rtas_init lacks a __init
annotation or the annotation of .xics_register_ics is wrong.

WARNING: arch/powerpc/kernel/built-in.o(.text+0x2db30):
The function .start_secondary() references
the function __cpuinit .vdso_getcpu_init().
This is often because .start_secondary lacks a __cpuinit
annotation or the annotation of .vdso_getcpu_init is wrong.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# cad5cef6 21-Dec-2012 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

POWERPC: drivers: remove __dev* attributes.

CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.

This change removes the use of __devinit, __devexit_p, __devinitdata,
__devinitconst, and __devexit from these drivers.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 512691d4 14-Oct-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S HV: Allow KVM guests to stop secondary threads coming online

When a Book3S HV KVM guest is running, we need the host to be in
single-thread mode, that is, all of the cores (or at least all of
the cores where the KVM guest could run) to be running only one
active hardware thread. This is because of the hardware restriction
in POWER processors that all of the hardware threads in the core
must be in the same logical partition. Complying with this restriction
is much easier if, from the host kernel's point of view, only one
hardware thread is active.

This adds two hooks in the SMP hotplug code to allow the KVM code to
make sure that secondary threads (i.e. hardware threads other than
thread 0) cannot come online while any KVM guest exists. The KVM
code still has to check that any core where it runs a guest has the
secondary threads offline, but having done that check it can now be
sure that they will not come online while the guest is running.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# e6651de9 20-Jul-2012 Zhao Chenhui <chenhui.zhao@freescale.com>

powerpc/smp: Do not disable IPI interrupts during suspend

During suspend, all interrupts including IPI will be disabled. In this case,
the suspend process will hang in SMP. To prevent this, pass the flag
IRQF_NO_SUSPEND when requesting IPI irq.

Signed-off-by: Zhao Chenhui <chenhui.zhao@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>


# ae5cab47 20-Jul-2012 Zhao Chenhui <chenhui.zhao@freescale.com>

powerpc/smp: add generic_set_cpu_up() to set cpu_state as CPU_UP_PREPARE

In the case of cpu hotplug, the cpu_state should be set to CPU_UP_PREPARE
when kicking cpu. Otherwise, the cpu_state is always CPU_DEAD after
calling generic_set_cpu_dead(), which makes the delay in generic_cpu_die()
not happen.

Signed-off-by: Zhao Chenhui <chenhui.zhao@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>


# 9fb1b36c 04-Sep-2012 Paul Mackerras <paulus@samba.org>

powerpc: Make sure IPI handlers see data written by IPI senders

We have been observing hangs, both of KVM guest vcpu tasks and more
generally, where a process that is woken doesn't properly wake up and
continue to run, but instead sticks in TASK_WAKING state. This
happens because the update of rq->wake_list in ttwu_queue_remote()
is not ordered with the update of ipi_message in
smp_muxed_ipi_message_pass(), and the reading of rq->wake_list in
scheduler_ipi() is not ordered with the reading of ipi_message in
smp_ipi_demux(). Thus it is possible for the IPI receiver not to see
the updated rq->wake_list and therefore conclude that there is nothing
for it to do.

In order to make sure that anything done before smp_send_reschedule()
is ordered before anything done in the resulting call to scheduler_ipi(),
this adds barriers in smp_muxed_message_pass() and smp_ipi_demux().
The barrier in smp_muxed_message_pass() is a full barrier to ensure that
there is a full ordering between the smp_send_reschedule() caller and
scheduler_ipi(). In smp_ipi_demux(), we use xchg() rather than
xchg_local() because xchg() includes release and acquire barriers.
Using xchg() rather than xchg_local() makes sense given that
ipi_message is not just accessed locally.

This moves the barrier between setting the message and calling the
cause_ipi() function into the individual cause_ipi implementations.
Most of them -- those that used outb, out_8 or similar -- already had
a full barrier because out_8 etc. include a sync before the MMIO
store. This adds an explicit barrier in the two remaining cases.

These changes made no measurable difference to the speed of IPIs as
measured using a simple ping-pong latency test across two CPUs on
different cores of a POWER7 machine.

The analysis of the reason why processes were not waking up properly
is due to Milton Miller.

Cc: stable@vger.kernel.org # v3.0+
Reported-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 18ad51dd 04-Jul-2012 Anton Blanchard <anton@samba.org>

powerpc: Add VDSO version of getcpu

We have a request for a fast method of getting CPU and NUMA node IDs
from userspace. This patch implements a getcpu VDSO function,
similar to x86.

Ben suggested we use SPRG3 which is userspace readable. SPRG3 can be
modified by a KVM guest, so we save the SPRG3 value in the paca and
restore it when transitioning from the guest to the host.

I have a glibc patch that implements sched_getcpu on top of this.
Testing on a POWER7:

baseline: 538 cycles
vdso: 30 cycles

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e250d4bc 28-May-2012 Yong Zhang <yong.zhang@windriver.com>

powerpc/smp: remove call to ipi_call_lock()/ipi_call_unlock()

1) call_function.lock used in smp_call_function_many() is just to protect
call_function.queue and &data->refs, cpu_online_mask is outside of the
lock. And it's not necessary to protect cpu_online_mask,
because data->cpumask is pre-calculate and even if a cpu is brougt up
when calling arch_send_call_function_ipi_mask(), it's harmless because
validation test in generic_smp_call_function_interrupt() will take care
of it.

2) For cpu down issue, stop_machine() will guarantee that no concurrent
smp_call_fuction() is processing.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 72ec61a9 29-May-2012 Yong Zhang <yong.zhang0@gmail.com>

POWERPC: Smp: remove call to ipi_call_lock()/ipi_call_unlock()

ipi_call_lock/unlock() lock resp. unlock call_function.lock. This lock
protects only the call_function data structure itself, but it's
completely unrelated to cpu_online_mask. The mask to which the IPIs
are sent is calculated before call_function.lock is taken in
smp_call_function_many(), so the locking around set_cpu_online() is
pointless and can be removed.

[ tglx: Massaged changelog ]

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: ralf@linux-mips.org
Cc: sshtylyov@mvista.com
Cc: david.daney@cavium.com
Cc: nikunj@linux.vnet.ibm.com
Cc: paulmck@linux.vnet.ibm.com
Cc: axboe@kernel.dk
Cc: peterz@infradead.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1338275765-3217-10-git-send-email-yong.zhang0@gmail.com
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 17e32eac 20-Apr-2012 Thomas Gleixner <tglx@linutronix.de>

powerpc: Use generic idle thread allocation

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Link: http://lkml.kernel.org/r/20120420124557.311212868@linutronix.de


# 8239c25f 20-Apr-2012 Thomas Gleixner <tglx@linutronix.de>

smp: Add task_struct argument to __cpu_up()

Preparatory patch to make the idle thread allocation for secondary
cpus generic.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20120420124556.964170564@linutronix.de


# ae3a197e 28-Mar-2012 David Howells <dhowells@redhat.com>

Disintegrate asm/system.h for PowerPC

Disintegrate asm/system.h for PowerPC.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
cc: linuxppc-dev@lists.ozlabs.org


# 8a25a2fd 21-Dec-2011 Kay Sievers <kay.sievers@vrfy.org>

cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem

This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.

After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.

Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.

Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 3b5e16d7 04-Oct-2011 Thomas Gleixner <tglx@linutronix.de>

powerpc: Mark IPI interrupts IRQF_NO_THREAD

IPI handlers cannot be threaded. Remove the obsolete IRQF_DISABLED
flag (see commit e58aa3d2) while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# a3a9f3b4 21-Oct-2011 Yong Zhang <yong.zhang0@gmail.com>

powerpc/irq: Remove IRQF_DISABLED

Since commit [e58aa3d2: genirq: Run irq handlers with interrupts disabled],
We run all interrupt handlers with interrupts disabled
and we even check and yell when an interrupt handler
returns with interrupts enabled (see commit [b738a50a:
genirq: Warn when handler enables interrupts]).

So now this flag is a NOOP and can be removed.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 4b16f8e2 22-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

powerpc: various straight conversions from module.h --> export.h

All these files were including module.h just for the basic
EXPORT_SYMBOL infrastructure. We can shift them off to the
export.h header which is a way smaller footprint and thus
realize some compile time gains.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# fb82b839 19-Sep-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: More generic support for "soft hotplug"

This adds more generic support for doing CPU hotplug with a simple
idle loop and no actual reset of the processors. The generic
smp_generic_kick_cpu() does the hotplug bringup trick if the PACA
shows that the CPU has already been started at boot and we provide
an accessor for the CPU state.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 60063497 26-Jul-2011 Arun Sharma <asharma@fb.com>

atomic: use <linux/atomic.h>

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# de56a948 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Add support for Book3S processors in hypervisor mode

This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.

This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.

Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.

This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.

With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.

We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.

In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.

The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.

This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.

This also adds a few new exports needed by the book3s_hv code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 3160b097 28-Jun-2011 Becky Bruce <beckyb@kernel.crashing.org>

powerpc: Create next_tlbcam_idx percpu variable for FSL_BOOKE

This is used to round-robin TLBCAM entries.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>


# 3d97a619 22-Jun-2011 Scott Wood <scottwood@freescale.com>

powerpc/book3e-64: Reraise doorbell when masked by soft-irq-disable

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 9ca980dc 25-May-2011 Paul Mackerras <paulus@samba.org>

powerpc: Avoid extra indirect function call in sending IPIs

On many platforms (including pSeries), smp_ops->message_pass is always
smp_muxed_ipi_message_pass. This changes arch/powerpc/kernel/smp.c so
that if smp_ops->message_pass is NULL, it calls smp_muxed_ipi_message_pass
directly.

This means that a platform doesn't need to set both .message_pass and
.cause_ipi, only one of them. It is a slight performance improvement
in that it gets rid of an indirect function call at the expense of a
predictable conditional branch.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 7ef71d75 24-May-2011 Milton Miller <miltonm@bga.com>

powerpc/cell: Use common smp ipi actions

The cell iic interrupt controller has enough software caused interrupts
to use a unique interrupt for each of the 4 messages powerpc uses.
This means each interrupt gets its own irq action/data combination.

Use the seperate, optimized, arch common ipi action functions
registered via the helper smp_request_message_ipi instead passing the
message as action data to a single action that then demultipexes to
the required acton via a switch statement.

smp_request_message_ipi will register the action as IRQF_PER_CPU
and IRQF_DISABLED, and WARN if the allocation fails for some reason,
so no need to print on that failure. It will return positive if
the message will not be used by the kernel, in which case we can
free the virq.

In addition to elimiating inefficient code, this also corrects the
error that a kernel built with kexec but without a debugger would
not register the ipi for kdump to notify the other cpus of a crash.

This also restores the debugger action to be static to kernel/smp.c.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 71454272 10-May-2011 Milton Miller <miltonm@bga.com>

powerpc: Use bytes instead of bitops in smp ipi multiplexing

Since there are only 4 messages, we can replace the atomic bit set
(which uses atomic load reserve and store conditional sequence) with
a byte stores to seperate bytes. We still have to perform a load
reserve and store conditional sequence to avoid loosing messages on
reception but we can do that with a single call to xchg.

The do {} while and __BIG_ENDIAN specific mask testing was chosen by
looking at the generated asm code. On gcc-4.4, the bit masking becomes
a simple bit mask and test of the register returned from xchg without
storing and loading the value to the stack like attempts with a union
of bytes and an int (or worse, loading single bit constants from the
constant pool into non-voliatle registers that had to be preseved on
the stack). The do {} while avoids an unconditional branch to the
end of the loop to test the entry / repeat condition of a while loop
and instead optimises for the expected single iteration of the loop.

We have a full mb() at the beginning to cover ordering between send,
ipi, and receive so we can use xchg_local and forgo the further
acquire and release barriers of xchg.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 1ece355b 10-May-2011 Milton Miller <miltonm@bga.com>

powerpc: Add kconfig for muxed smp ipi support

Compile the new smp ipi mux and demux code only if a platform
will make use of it. The new config is selected as required.

The new cause_ipi smp op is only available conditionally to point out
configs where the select is required; this makes setting the op an
immediate fail instead of a deferred unresolved symbol at link.

This also creates a new config for power surge powermac upgrade support
that can be disabled in expert mode but is default on.

I also removed the depends / default y on CONFIG_XICS since it is selected
by PSERIES.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 23d72bfd 10-May-2011 Milton Miller <miltonm@bga.com>

powerpc: Consolidate ipi message mux and demux

Consolidate the mux and demux of ipi messages into smp.c and call
a new smp_ops callback to actually trigger the ipi.

The powerpc architecture code is optimised for having 4 distinct
ipi triggers, which are mapped to 4 distinct messages (ipi many, ipi
single, scheduler ipi, and enter debugger). However, several interrupt
controllers only provide a single software triggered interrupt that
can be delivered to each cpu. To resolve this limitation, each smp_ops
implementation created a per-cpu variable that is manipulated with atomic
bitops. Since these lines will be contended they are optimialy marked as
shared_aligned and take a full cache line for each cpu. Distro kernels
may have 2 or 3 of these in their config, each taking per-cpu space
even though at most one will be in use.

This consolidation removes smp_message_recv and replaces the single call
actions cases with direct calls from the common message recognition loop.
The complicated debugger ipi case with its muxed crash handling code is
moved to debug_ipi_action which is now called from the demux code (instead
of the multi-message action calling smp_message_recv).

I put a call to reschedule_action to increase the likelyhood of correctly
merging the anticipated scheduler_ipi() hook coming from the scheduler
tree; that single required call can be inlined later.

The actual message decode is a copy of the old pseries xics code with its
memory barriers and cache line spacing, augmented with a per-cpu unsigned
long based on the book-e doorbell code. The optional data is set via a
callback from the implementation and is passed to the new cause-ipi hook
along with the logical cpu number. While currently only the doorbell
implemntation uses this data it should be almost zero cost to retrieve and
pass it -- it adds a single register load for the argument from the same
cache line to which we just completed a store and the register is dead
on return from the call. I extended the data element from unsigned int
to unsigned long in case some other code wanted to associate a pointer.

The doorbell check_self is replaced by a call to smp_muxed_ipi_resend,
conditioned on the CPU_DBELL feature. The ifdef guard could be relaxed
to CONFIG_SMP but I left it with BOOKE for now.

Also, the doorbell interrupt vector for book-e was not calling irq_enter
and irq_exit, which throws off cpu accounting and causes code to not
realize it is running in interrupt context. Add the missing calls.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e0476371 10-May-2011 Milton Miller <miltonm@bga.com>

powerpc: Remove call sites of MSG_ALL_BUT_SELF

The only user of MSG_ALL_BUT_SELF in the whole kernel tree is powerpc,
and it only uses it to start the debugger. Both debuggers always call
smp_send_debugger_break with MSG_ALL_BUT_SELF, and only mpic can do
anything more optimal than a loop over all online cpus, but all message
passing implementations have to code for this special delivery target.

Convert smp_send_debugger_break to take void and loop calling the smp_ops
message_pass function for each of the other cpus in the online cpumask.

Use raw_smp_processor_id() because we are either entering the debugger
or trying to start kdump and the additional warning it not useful were
it to trigger.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c560bbce 17-May-2011 kerstin jonsson <kerstin.jonsson@ericsson.com>

powerpc/4xx: Fix regression in SMP on 476

commit c56e58537d504706954a06570b4034c04e5b7500 breaks SMP support in PPC_47x chip.
secondary_ti must be set to current thread info before callin kick_cpu or else
start_secondary_47x will jump into void when trying to return to c-code.
In the current setup secondary_ti is initialized before the CPU idle task is started
and only the boot core will start. I am not sure this is the correct solution, but it
makes SMP possible in my chip.
Note! The HOTPLUG support probably need some fixing to, There is no trampoline code
available in head_44x.S - start_secondary_resume?

Signed-off-by: Kerstin Jonsson <kerstin.jonsson@ericsson.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 104699c0 27-Apr-2011 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

powerpc: Convert old cpumask API into new one

Adapt new API.

Almost change is trivial. Most important change is the below line
because we plan to change task->cpus_allowed implementation.

- ctx->cpus_allowed = current->cpus_allowed;

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# de300974 11-Apr-2011 Michael Ellerman <michael@ozlabs.org>

powerpc/smp: smp_ops->kick_cpu() should be able to fail

When we start a cpu we use smp_ops->kick_cpu(), which currently
returns void, it should be able to fail. Convert it to return
int, and update all uses.

Convert all the current error cases to return -ENOENT, which is
what would eventually be returned by __cpu_up() currently when
it doesn't detect the cpu as coming up in time.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 184748cc 05-Apr-2011 Peter Zijlstra <a.p.zijlstra@chello.nl>

sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()

For future rework of try_to_wake_up() we'd like to push part of that
function onto the CPU the task is actually going to run on.

In order to do so we need a generic callback from the existing scheduler IPI.

This patch introduces such a generic callback: scheduler_ipi() and
implements it as a NOP.

BenH notes: PowerPC might use this IPI on offline CPUs under rare conditions!

Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152728.744338123@chello.nl


# aeeafbfa 07-Mar-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: Increase vdso_data->processorCount, not just decrease it

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c56e5853 07-Mar-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: Create idle threads on demand and properly reset them

Instead of creating idle threads at boot for all possible CPUs, we
create them on demand, like x86 or ARM, and we properly call init_idle
to re-initialize an idle thread when a CPU was unplugged and is now
re-plugged.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 105765f4 31-Mar-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: Don't expose per-cpu "cpu_state" array

Instead, keep it static, expose an accessor and use that from
the PowerMac code. Avoids easy namespace collisions and will
make it easier to consolidate with other implementations.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d7294445 07-Mar-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: Add a smp_ops->bringup_up() done callback

This allows us to stop abusing smp_ops->setup_cpu() for cleanup
tasks that have to take place after the initial boot time CPU
bringup.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 1c91cc57 10-Feb-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/pmac/smp: Rename fixup_irqs() to migrate_irqs() and use it on ppc32

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 7a53a4fe 10-Feb-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: Remove unused smp_ops->cpu_enable()

Remove the last remnants of cpu_enable(), everybody uses the normal
__cpu_up() path now

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b527d071 10-Feb-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: Remove unused generic_cpu_enable()

Nobody uses it, besides we should always use the normal __cpu_up
path anyways

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 4fcb8833 10-Feb-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: Fix generic_mach_cpu_die()

This is used by some "soft" hotplug implementations. I needs to
call idle_task_exit() when the CPU is going away, and we remove
the now no-longer needed set_cpu_online() and local_irq_enable()
which are handled by the return to start_secondary

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# fa3f82c8 10-Feb-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/smp: soft-replugged CPUs must go back to start_secondary

Various thing are torn down when a CPU is hot-unplugged. That CPU
is expected to go back to start_secondary when re-plugged to re
initialize everything, such as clock sources, maps, ...

Some implementations just return from cpu_die() callback
in the idle loop when the CPU is "re-plugged". This is not enough.

We fix it using a little asm trampoline which resets the stack
and calls back into start_secondary as if we were all fresh from
boot. The trampoline already existed on ppc64, but we add it for
ppc32

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 99d86705 06-Oct-2010 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

powerpc: Cleanup APIs for cpu/thread/core mappings

These APIs take logical cpu number as input
Change cpu_first_thread_in_core() to cpu_first_thread_sibling()
Change cpu_last_thread_in_core() to cpu_last_thread_sibling()

These APIs convert core number (index) to logical cpu/thread numbers
Add cpu_first_thread_of_core(int core)
Changed cpu_thread_to_core() to cpu_core_index_of_thread(int cpu)

The goal is to make 'threads_per_core' accessible to the
pseries_energy module. Instead of making an API to read
threads_per_core, this is a higher level wrapper function to
convert from logical cpu number to core number.

The current APIs cpu_first_thread_in_core() and
cpu_last_thread_in_core() returns logical CPU number while
cpu_thread_to_core() returns core number or index which is
not a logical CPU number. The new APIs are now clearly named to
distinguish 'core number' versus first and last 'logical cpu
number' in that core.

The new APIs cpu_{first,last}_thread_sibling() work on
logical cpu numbers. While cpu_first_thread_of_core() and
cpu_core_index_of_thread() work on core index.

Example usage: (4 threads per core system)

cpu_first_thread_sibling(5) = 4
cpu_last_thread_sibling(5) = 7
cpu_core_index_of_thread(5) = 1
cpu_first_thread_of_core(1) = 4

cpu_core_index_of_thread() is used in cpu_to_drc_index() in the
module and cpu_first_thread_of_core() is used in
drc_index_to_cpu() in the module.

Make API changes to few callers. Export symbols for use in modules.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# cf9efce0 26-Aug-2010 Paul Mackerras <paulus@samba.org>

powerpc: Account time using timebase rather than PURR

Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
PURR register for measuring the user and system time used by
processes, as well as other related times such as hardirq and
softirq times. This turns out to be quite confusing for users
because it means that a program will often be measured as taking
less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
than it does when run on a single-threaded processor (ST mode), even
though the program takes longer to finish. The discrepancy is
accounted for as stolen time, which is also confusing, particularly
when there are no other partitions running.

This changes the accounting to use the timebase instead, meaning that
the reported user and system times are the actual number of real-time
seconds that the program was executing on the processor thread,
regardless of which SMT mode the processor is in. Thus a program will
generally show greater user and system times when run on a
multi-threaded processor than on a single-threaded processor.

On pSeries systems on POWER5 or later processors, we measure the
stolen time (time when this partition wasn't running) using the
hypervisor dispatch trace log. We check for new entries in the
log on every entry from user mode and on every transition from
kernel process context to soft or hard IRQ context (i.e. when
account_system_vtime() gets called). So that we can correctly
distinguish time stolen from user time and time stolen from system
time, without having to check the log on every exit to user mode,
we store separate timestamps for exit to user mode and entry from
user mode.

On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
in account_system_vtime() (as before), and then apportion the SPURR
ticks since the last time we read it between scaled user time and
scaled system time according to the relative proportions of user
time and system time over the same interval. This avoids having to
read the SPURR on every kernel entry and exit. On systems that have
PURR but not SPURR (i.e., POWER5), we do the same using the PURR
rather than the SPURR.

This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
for now since it conflicts with the use of the dispatch trace log
by the time accounting code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e1f0ece1 10-Aug-2010 Michael Neuling <mikey@neuling.org>

powerpc: Move arch_sd_sibling_asym_packing() to smp.c

Simple cleanup by moving arch_sd_sibling_asym_packing from process.c to
smp.c to save an #ifdef CONFIG_SMP

No functionality change.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 6685a477 04-Aug-2010 Signed-off-by: Darren Hart <dvhltc@us.ibm.com>

powerpc: Silence __cpu_up() under normal operation

During CPU offline/online tests __cpu_up would flood the logs with
the following message:

Processor 0 found.

This provides no useful information to the user as there is no context
provided, and since the operation was a success (to this point) it is expected
that the CPU will come back online, providing all the feedback necessary.

Change the "Processor found" message to DBG() similar to other such messages in
the same function. Also, add an appropriate log level for the "Processor is
stuck" message.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d77cb21b 15-Jul-2010 Tiejun Chen <tiejun.chen@windriver.com>

powerpc/smp: remove the incorrect decrementer initial codes for AP

We already defined start_cpu_decrementer() to invoke decrementer for AP as
the following path:

start_secondary() -> secondary_cpu_time_init() -> start_cpu_decrementer()

So remove these incorrect codes introduced from commit:
e7f75ad0 powerpc/47x: Base ppc476 support

And actually we really should not enable decrementer before calling set_dec().

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d75d68cf 20-Jun-2010 Paul Mackerras <paulus@samba.org>

powerpc: Clean up obsolete code relating to decrementer and timebase

Since the decrementer and timekeeping code was moved over to using
the generic clockevents and timekeeping infrastructure, several
variables and functions have been obsolete and effectively unused.
This deletes them.

In particular, wakeup_decrementer() is no longer needed since the
generic code reprograms the decrementer as part of the process of
resuming the timekeeping code, which happens during sysdev resume.
Thus the wakeup_decrementer calls in the suspend_enter methods for
52xx platforms have been removed. The call in the powermac cpu
frequency change code has been replaced by set_dec(1), which will
cause a timer interrupt as soon as interrupts are enabled, and the
generic code will then reprogram the decrementer with the correct
value.

This also simplifies the generic_suspend_en/disable_irqs functions
and makes them static since they are not referenced outside time.c.
The preempt_enable/disable calls are removed because the generic
code has disabled all but the boot cpu at the point where these
functions are called, so we can't be moved to another cpu.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c1aa687d 20-Jun-2010 Paul Mackerras <paulus@samba.org>

powerpc: Clean up obsolete code relating to decrementer and timebase

Since the decrementer and timekeeping code was moved over to using
the generic clockevents and timekeeping infrastructure, several
variables and functions have been obsolete and effectively unused.
This deletes them.

In particular, wakeup_decrementer() is no longer needed since the
generic code reprograms the decrementer as part of the process of
resuming the timekeeping code, which happens during sysdev resume.
Thus the wakeup_decrementer calls in the suspend_enter methods for
52xx platforms have been removed. The call in the powermac cpu
frequency change code has been replaced by set_dec(1), which will
cause a timer interrupt as soon as interrupts are enabled, and the
generic code will then reprogram the decrementer with the correct
value.

This also simplifies the generic_suspend_en/disable_irqs functions
and makes them static since they are not referenced outside time.c.
The preempt_enable/disable calls are removed because the generic
code has disabled all but the boot cpu at the point where these
functions are called, so we can't be moved to another cpu.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# abb17f9c 18-May-2010 Milton Miller <miltonm@bga.com>

powerpc: Use common cpu_die (fixes SMP+SUSPEND build)

Configuring a powerpc 32 bit kernel for both SMP and SUSPEND turns on
CPU_HOTPLUG to enable disable_nonboot_cpus to be called by the common
suspend code. Previously the definition of cpu_die for ppc32 was in
the powermac platform code, causing it to be undefined if that platform
as not selected.

arch/powerpc/kernel/built-in.o: In function 'cpu_idle':
arch/powerpc/kernel/idle.c:98: undefined reference to 'cpu_die'

Move the code from setup_64 to smp.c and rename the power mac
versions to their specific names.

Note that this does not setup the cpu_die pointers in either
smp_ops (request a given cpu die) or ppc_md (make this cpu die),
for other platforms but there are generic versions in smp.c.

Reported-by: Matt Sealey <matt@genesi-usa.com>
Reported-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Anton Vorontsov <avorontsov@mvista.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 828a6986 26-Apr-2010 Anton Blanchard <anton@samba.org>

powerpc/cpumask: Update some comments

Since the *_map cpumask variants are deprecated, change the comments to
instead refer to *_mask.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# cc1ba8ea 26-Apr-2010 Anton Blanchard <anton@samba.org>

powerpc/cpumask: Dynamically allocate cpu_sibling_map and cpu_core_map cpumasks

Dynamically allocate cpu_sibling_map and cpu_core_map cpumasks.

We don't need to set_cpu_online() the boot cpu in smp_prepare_boot_cpu,
init/main.c does it for us.

We also postpone setting of the boot cpu in cpu_sibling_map and cpu_core_map
until when the memory allocator is available (smp_prepare_cpus), similar
to x86.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b6decb70 26-Apr-2010 Anton Blanchard <anton@samba.org>

powerpc/cpumask: Convert fixup_irqs to new cpumask API

Use new cpumask_* functions, and dynamically allocate cpumask in fixup_irqs.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# bfb9126d 26-Apr-2010 Anton Blanchard <anton@samba.org>

powerpc/cpumask: Convert smp_cpus_done to new cpumask API

Use the new cpumask_* functions and dynamically allocate the cpumask in
smp_cpus_done.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e7f75ad0 05-Mar-2010 Dave Kleikamp <shaggy@linux.vnet.ibm.com>

powerpc/47x: Base ppc476 support

This patch adds the base support for the 476 processor. The code was
primarily written by Ben Herrenschmidt and Torez Smith, but I've been
maintaining it for a while.

The goal is to have a single binary that will run on 44x and 47x, but
we still have some details to work out. The biggest is that the L1 cache
line size differs on the two platforms, but it's currently a compile-time
option.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Torez Smith <lnxtorez@linux.vnet.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>


# 21dbeb91 25-Mar-2010 Julia Lawall <julia@diku.dk>

powerpc: Use set_cpus_allowed_ptr

Use set_cpus_allowed_ptr rather than set_cpus_allowed.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression E1,E2;
@@

- set_cpus_allowed(E1, cpumask_of_cpu(E2))
+ set_cpus_allowed_ptr(E1, cpumask_of(E2))

@@
expression E;
identifier I;
@@

- set_cpus_allowed(E, I)
+ set_cpus_allowed_ptr(E, &I)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d0174c72 14-Jan-2010 Nathan Fontenot <nfont@austin.ibm.com>

powerpc: Move cpu hotplug driver lock from pseries to powerpc

Move the defintion and lock helper routines for the cpu hotplug driver
lock from pseries to powerpc code to avoid build breaks for platforms
other than pseries that use cpu hotplug.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Acked-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8389b37d 25-Nov-2009 Valentine Barshak <vbarshak@ru.mvista.com>

powerpc: stop_this_cpu: remove the cpu from the online map.

Remove the CPU from the online map to prevent smp_call_function
from sending messages to a stopped CPU.

Signed-off-by: Valentine Barshak <vbarshak@ru.mvista.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 6b7487fc 29-Oct-2009 Tejun Heo <tj@kernel.org>

percpu: make percpu symbols in powerpc unique

This patch updates percpu related symbols in powerpc such that percpu
symbols are unique and don't clash with local symbols. This serves
two purposes of decreasing the possibility of global percpu symbol
collision and allowing dropping per_cpu__ prefix from percpu symbols.

* arch/powerpc/kernel/perf_callchain.c: s/callchain/cpu_perf_callchain/

* arch/powerpc/kernel/setup-common.c: s/pvr/cpu_pvr/

* arch/powerpc/platforms/pseries/dtl.c: s/dtl/cpu_dtl/

* arch/powerpc/platforms/cell/interrupt.c: s/iic/cpu_iic/

Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
which cause name clashes" patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@ozlabs.org


# ea0f1cab 24-Sep-2009 Rusty Russell <rusty@rustcorp.com.au>

cpumask: Use accessors for cpu_*_mask: powerpc

Use the accessors rather than frobbing bits directly (the new versions
are const).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>


# f063ea02 24-Sep-2009 Rusty Russell <rusty@rustcorp.com.au>

cpumask: arch_send_call_function_ipi_mask: powerpc

We're weaning the core code off handing cpumask's around on-stack.
This introduces arch_send_call_function_ipi_mask(), and by defining
it, the old arch_send_call_function_ipi is defined by the core code.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 757cbd46 08-Sep-2009 Kumar Gala <galak@kernel.crashing.org>

powerpc/85xx: Fix SMP compile error and allow NULL for smp_ops

The following commit introduced a compile error since it removed
the implementation of smp_85xx_basic_setup:

commit 77c0a700c1c292edafa11c1e52821ce4636f81b0
Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Fri Aug 28 14:25:04 2009 +1000

powerpc: Properly start decrementer on BookE secondary CPUs

Make it so that smp_ops probe() and setup_cpu() can be set to NULL.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 67764263 23-Jun-2009 Gautham R Shenoy <ego@in.ibm.com>

powerpc/pseries: Reduce the polling interval in __cpu_up()

Time time taken for a single cpu online operation on a pseries machine
is as follows:
Dedicated LPAR (POWER6): ~220ms.
Shared LPAR (POWER5) : ~240ms.

Of this time, approximately 200ms is taken up by __cpu_up(). This is because
we poll every 200ms to check if the new cpu has notified it's presence
through the cpu_callin_map. We repeat this operation until the new cpu sets
the value in cpu_callin_map or 5 seconds elapse, whichever comes earlier.

However, using completion_structs instead of polling loops,
the time taken by the new processor to indicate it's presence has
found to be less than 1ms on pseries. This method however may not
work on all powerpc platforms due to the time-base synchronization code.

Keeping this in mind, we could reduce msleep polling interval from
200ms to 1ms while retaining the 5 second timeout.

With this, the time taken for a cpu online operation changes as follows:
Dedicated LPAR (POWER6): 20-25ms.
Shared LPAR (POWER5) : 60-80ms.

In both these cases, it was found that the code polls through the loop
only once indicating that 1ms is a reasonable value, atleast on pseries.

The code needs testing on other powerpc platforms.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Joel Schopp <jschopp@austin.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 7ccbe504 18-Jun-2009 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/pmac: Fix issues with PowerMac "PowerSurge" SMP

The old PowerSurge SMP (ie, dual or quad 604 machines) code has
numerous issues in modern world.

One is cpu_possible_map is set too late (the device-tree is bogus)
so we fail to allocate the interrupt stacks and crash. Another
problem is the fact the timebase is frozen by the bringup of the
second CPU so the delays in the generic code will hang, we need
to move some of the calling procedure to inside the powermac code.

This makes it boot again for me

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b2ea25b9 10-Dec-2008 Nathan Lynch <ntl@pobox.com>

powerpc: Convert cpu_to_l2cache() to of_find_next_cache_node()

The smp code uses cache information to populate cpu_core_map; change
it to use common code for cache lookup.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 13a9801e 10-Dec-2008 Nathan Lynch <ntl@pobox.com>

powerpc: Move smp_hw_index to 32-bit code

smp_hw_index isn't used on 64-bit, so move it from smp.c to
setup_32.c.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 98a79d6a 13-Dec-2008 Rusty Russell <rusty@rustcorp.com.au>

cpumask: centralize cpu_online_map and cpu_possible_map

Impact: cleanup

Each SMP arch defines these themselves. Move them to a central
location.

Twists:
1) Some archs (m32, parisc, s390) set possible_map to all 1, so we add a
CONFIG_INIT_ALL_POSSIBLE for this rather than break them.

2) mips and sparc32 '#define cpu_possible_map phys_cpu_present_map'.
Those archs simply have phys_cpu_present_map replaced everywhere.

3) Alpha defined cpu_possible_map to cpu_present_map; this is tricky
so I just manipulate them both in sync.

4) IA64, cris and m32r have gratuitous 'extern cpumask_t cpu_possible_map'
declarations.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Travis <travis@sgi.com>
Cc: ink@jurassic.park.msu.ru
Cc: rmk@arm.linux.org.uk
Cc: starvik@axis.com
Cc: tony.luck@intel.com
Cc: takata@linux-m32r.org
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: paulus@samba.org
Cc: schwidefsky@de.ibm.com
Cc: lethal@linux-sh.org
Cc: wli@holomorphy.com
Cc: davem@davemloft.net
Cc: jdike@addtoit.com
Cc: mingo@redhat.com


# 25ddd738 14-Nov-2008 Milton Miller <miltonm@bga.com>

powerpc: Provide a separate handler for each IPI action

With the new generic smp call function helpers, I noticed the code in
smp_message_recv was a single function call in many cases. While
getting the message number from the ipi data is easy, we can reduce
the path length by a function and data-dependent switch by registering
seperate IPI actions for these simple calls.

Originally I left the ipi action array exposed, but then I realized the
registration code should be common too.

The three users each had their own name array, so I made a fourth
to convert all users to use a common one.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 22d660ff 09-Oct-2008 Milton Miller <miltonm@bga.com>

powerpc/smp: No need to set_need_resched when getting a resched IPI

The comment in the code was asking "Do we have to do this?", and according
to x86 and s390 the answer is no, the scheduler will do it before calling
the arch hook.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e545a614 07-Sep-2008 Manfred Spraul <manfred@colorfullife.com>

kernel/cpu.c: create a CPU_STARTING cpu_chain notifier

Right now, there is no notifier that is called on a new cpu, before the new
cpu begins processing interrupts/softirqs.
Various kernel function would need that notification, e.g. kvm works around
by calling smp_call_function_single(), rcu polls cpu_online_map.

The patch adds a CPU_STARTING notification. It also adds a helper function
that sends the message to all cpu_chain handlers.

Tested on x86-64.
All other archs are untested. Especially on sparc, I'm not sure if I got
it right.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# e9efed3b 26-Jul-2008 Nathan Lynch <ntl@pobox.com>

powerpc: Make core id information available to userspace

Existing Open Firmware practice is to report each processor core as a
separate node in the device tree. Report the value of the "reg" OF
property corresponding to a logical CPU's device node as the core_id
attribute in /sys/devices/system/cpu/cpu*/topology/core_id.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 440a0857 26-Jul-2008 Nathan Lynch <ntl@pobox.com>

powerpc: Make core sibling information available to userspace

Implement the notion of "core siblings" for powerpc. This makes
/sys/devices/system/cpu/cpu*/topology/core_siblings present sensible
values, indicating online CPUs which share an L2 cache.

BenH: Made cpu_to_l2cache() use of_find_node_by_phandle() instead
of IBM-specific open coded search

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e2075f79 26-Jul-2008 Nathan Lynch <ntl@pobox.com>

powerpc: Update cpu_sibling_maps dynamically

Rather doing one initialization pass over all the per-cpu
cpu_sibling_maps at boot, update the maps at cpu online/offline time.

This is a behavior change -- the thread_siblings attribute now
reflects only online siblings, whereas it would display offline
siblings before. The new behavior matches that of x86, and is
arguably more useful.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8691e5a8 06-Jun-2008 Jens Axboe <jens.axboe@oracle.com>

smp_call_function: get rid of the unused nonatomic/retry argument

It's never used and the comments refer to nonatomic and retry
interchangably. So get rid of it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# b7d7a240 26-Jun-2008 Jens Axboe <jens.axboe@oracle.com>

powerpc: convert to generic helpers for IPI function calls

This converts ppc to use the new helpers for smp_call_function() and
friends, and adds support for smp_call_function_single().

ppc loses the timeout functionality of smp_call_function_mask() with
this change, as the generic code does not provide that.

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 1c21a293 07-May-2008 Michael Ellerman <michael@ellerman.id.au>

[POWERPC] Fix sparse warnings in arch/powerpc/kernel

Make a few things static in lparcfg.c
Make init and exit routines static in rtas_flash.c
Make things static in rtas_pci.c
Make some functions static in rtas.c
Make fops static in rtas-proc.c
Remove unneeded extern for do_gtod in smp.c
Make clocksource_init() static in time.c
Make last_tick_len and ticklen_to_xs static in time.c
Move the declaration of the pvr per-cpu into smp.h
Make kexec_smp_down() and kexec_stack static in machine_kexec_64.c
Don't return void in arch_teardown_msi_irqs() in msi.c
Move declaration of GregorianDay()into asm/time.h

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 3b575064 01-May-2008 Paul Mackerras <paulus@samba.org>

[POWERPC] Bolt in SLB entry for kernel stack on secondary cpus

This fixes a regression reported by Kamalesh Bulabel where a POWER4
machine would crash because of an SLB miss at a point where the SLB
miss exception was unrecoverable. This regression is tracked at:

http://bugzilla.kernel.org/show_bug.cgi?id=10082

SLB misses at such points shouldn't happen because the kernel stack is
the only memory accessed other than things in the first segment of the
linear mapping (which is mapped at all times by entry 0 of the SLB).
The context switch code ensures that SLB entry 2 covers the kernel
stack, if it is not already covered by entry 0. None of entries 0
to 2 are ever replaced by the SLB miss handler.

Where this went wrong is that the context switch code assumes it
doesn't have to write to SLB entry 2 if the new kernel stack is in the
same segment as the old kernel stack, since entry 2 should already be
correct. However, when we start up a secondary cpu, it calls
slb_initialize, which doesn't set up entry 2. This is correct for
the boot cpu, where we will be using a stack in the kernel BSS at this
point (i.e. init_thread_union), but not necessarily for secondary
cpus, whose initial stack can be allocated anywhere. This doesn't
cause any immediate problem since the SLB miss handler will just
create an SLB entry somewhere else to cover the initial stack.

In fact it's possible for the cpu to go quite a long time without SLB
entry 2 being valid. Eventually, though, the entry created by the SLB
miss handler will get overwritten by some other entry, and if the next
access to the stack is at an unrecoverable point, we get the crash.

This fixes the problem by making slb_initialize create a suitable
entry for the kernel stack, if we are on a secondary cpu and the stack
isn't covered by SLB entry 0. This requires initializing the
get_paca()->kstack field earlier, so I do that in smp_create_idle
where the current field is initialized. This also abstracts a bit of
the computation that mk_esid_data in slb.c does so that it can be used
in slb_initialize.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# e057d985 27-Dec-2007 Olof Johansson <olof@lixom.net>

[POWERPC] Make smp_send_stop() handle panic and xmon reboot

smp_send_stop() will send an IPI to all other cpus to shut them down.
However, for the case of xmon-based reboots (as well as potentially some
panics), the other cpus are (or might be) spinning with interrupts off,
and won't take the IPI.

Current code will drop us into the debugger when the IPI fails, which
means we're in an infinite loop that we can't get out of without an
external reset of some sort.

Instead, make the smp_send_stop() IPI call path just print the warning
about being unable to send IPIs, but make it return so the rest of the
shutdown sequence can continue. It's not perfect, but the lesser of
two evils.

Also move the call_lock handling outside of smp_call_function_map so we
can avoid deadlocks in smp_send_stop().

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# b616de5e 27-Dec-2007 Olof Johansson <olof@lixom.net>

[POWERPC] Make smp_call_function_map static

smp_call_function_map should be static, and for consistency prepend it
with __ like other local helper functions in the same file.

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# d5a7430d 16-Oct-2007 Mike Travis <travis@sgi.com>

Convert cpu_sibling_map to be a per cpu variable

Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly
from startup and CPU HOTPLUG functions.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d831d0b8 20-Sep-2007 Tony Breeds <tony@bakeyournoodle.com>

[POWERPC] Implement clockevents driver for powerpc

This registers a clock event structure for the decrementer and turns
on CONFIG_GENERIC_CLOCKEVENTS, which means that we now don't need
most of timer_interrupt(), since the work is done in generic code.
For secondary CPUs, their decrementer clockevent is registered when
the CPU comes up (the generic code automatically removes the
clockevent when the CPU goes down).

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 8fd7675c 17-Sep-2007 Satyam Sharma <satyam@infradead.org>

[POWERPC] Avoid pointless WARN_ON(irqs_disabled()) from panic codepath

> ------------[ cut here ]------------
> Badness at arch/powerpc/kernel/smp.c:202

comes when smp_call_function_map() has been called with irqs disabled,
which is illegal. However, there is a special case, the panic() codepath,
when we do not want to warn about this -- warning at that time is pointless
anyway, and only serves to scroll away the *real* cause of the panic and
distracts from the real bug.

* So let's extract the WARN_ON() from smp_call_function_map() into all its
callers -- smp_call_function() and smp_call_function_single()

* Also, introduce another caller of smp_call_function_map(), namely
__smp_call_function() (and make smp_call_function() a wrapper over this)
which does *not* warn about disabled irqs

* Use this __smp_call_function() from the panic codepath's smp_send_stop()

We also end having to move code of smp_send_stop() below the definition
of __smp_call_function().

Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 17aa3a82 31-Jul-2007 Kevin Corry <kevcorry@us.ibm.com>

[POWERPC] Fix num_cpus calculation in smp_call_function_map()

In smp_call_function_map(), num_cpus is set to the number of online
CPUs minus one. However, if the CPU mask does not include all CPUs
(except the one we're running on), the routine will hang in the first
while() loop until the 8 second timeout occurs.

The num_cpus should be set to the number of CPUs specified in the mask
passed into the routine, after we've made any modifications to the
mask. With this change, we can also get rid of the call to
cpus_empty() and avoid adding another pass through the bitmask.

Signed-off-by: Kevin Corry <kevcorry@us.ibm.com>
Signed-off-by: Carl Love <carll@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# adff093d 19-Jul-2007 Avi Kivity <avi@qumranet.com>

[POWERPC] Allow smp_call_function_single() to current cpu

This removes the requirement for callers to get_cpu() to check in simple
cases. i386 and x86_64 already received a similar treatment.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# d3fdaed9 18-May-2007 Hugh Dickins <hugh@veritas.com>

[POWERPC] Fix smp_call_function to be preempt-safe

smp_call_function_map() was not safe against preemption to another
cpu: its test for removing self from map was outside the spinlock.
Rearrange it a little to fix that.

smp_call_function_single() was also wrong: now get_cpu() before
excluding self, as other architectures do.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 44755d11 02-May-2007 will schmidt <will_schmidt@vnet.ibm.com>

[POWERPC] Add smp_call_function_map and smp_call_function_single

Add a new function named smp_call_function_single(). This matches a generic
prototype from include/linux/smp.h.

Add a function smp_call_function_map(). This is, for the most part, a rename
of smp_call_function, with some added cpumask support. smp_call_function and
smp_call_function_single call into smp_call_function_map.

Lightly tested on 970mp (blade), power4 and power5.

Signed-off-by: Will Schmidt <will_schmidt@vnet.ibm.com>
cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# a741e679 10-Apr-2007 Benjamin Herrenschmidt <benh@kernel.crashing.org>

[POWERPC] Make tlb flush batch use lazy MMU mode

The current tlb flush code on powerpc 64 bits has a subtle race since we
lost the page table lock due to the possible faulting in of new PTEs
after a previous one has been removed but before the corresponding hash
entry has been evicted, which can leads to all sort of fatal problems.

This patch reworks the batch code completely. It doesn't use the mmu_gather
stuff anymore. Instead, we use the lazy mmu hooks that were added by the
paravirt code. They have the nice property that the enter/leave lazy mmu
mode pair is always fully contained by the PTE lock for a given range
of PTEs. Thus we can guarantee that all batches are flushed on a given
CPU before it drops that lock.

We also generalize batching for any PTE update that require a flush.

Batching is now enabled on a CPU by arch_enter_lazy_mmu_mode() and
disabled by arch_leave_lazy_mmu_mode(). The code epects that this is
always contained within a PTE lock section so no preemption can happen
and no PTE insertion in that range from another CPU. When batching
is enabled on a CPU, every PTE updates that need a hash flush will
use the batch for that flush.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 775aeff4 08-Feb-2007 Michael Ellerman <michael@ellerman.id.au>

[POWERPC] Move MPIC smp routines into mpic.c

Move a couple of MPIC smp routines into mpic.c, they're inside an SMP
block in mpic.c - so they're still only built for SMP.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# b282b6f8 11-Jan-2007 Gautham R Shenoy <ego@in.ibm.com>

[PATCH] Change cpu_up and co from __devinit to __cpuinit

Compiling the kernel with CONFIG_HOTPLUG = y and CONFIG_HOTPLUG_CPU = n
with CONFIG_RELOCATABLE = y generates the following modpost warnings

WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141b7d) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141b9c) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.text:__cpu_up
from .text between '_cpu_up' (at offset 0xc0141bd8) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141c05) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141c26) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141c37) and 'cpu_up'

This is because cpu_up, _cpu_up and __cpu_up (in some architectures) are
defined as __devinit
AND
__cpu_up calls some __cpuinit functions.

Since __cpuinit would map to __init with this kind of a configuration,
we get a .text refering .init.data warning.

This patch solves the problem by converting all of __cpu_up, _cpu_up
and cpu_up from __devinit to __cpuinit. The approach is justified since
the callers of cpu_up are either dependent on CONFIG_HOTPLUG_CPU or
are of __init type.

Thus when CONFIG_HOTPLUG_CPU=y, all these cpu up functions would land up
in .text section, and when CONFIG_HOTPLUG_CPU=n, all these functions would
land up in .init section.

Tested on a i386 SMP machine running linux-2.6.20-rc3-mm1.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 36ca4ba4 24-Oct-2006 Christian Krafft <krafft@de.ibm.com>

[POWERPC] cell: add cpufreq driver for Cell BE processor

This patch adds a cpufreq backend driver to enable frequency scaling on cell.

Signed-off-by: Christian Krafft <krafft@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 7d12e780 05-Oct-2006 David Howells <dhowells@redhat.com>

IRQ: Maintain regs pointer globally rather than passing to IRQ handlers

Maintain a per-CPU global "struct pt_regs *" variable which can be used instead
of passing regs around manually through all ~1800 interrupt handlers in the
Linux kernel.

The regs pointer is used in few places, but it potentially costs both stack
space and code to pass it around. On the FRV arch, removing the regs parameter
from all the genirq function results in a 20% speed up of the IRQ exit path
(ie: from leaving timer_interrupt() to leaving do_IRQ()).

Where appropriate, an arch may override the generic storage facility and do
something different with the variable. On FRV, for instance, the address is
maintained in GR28 at all times inside the kernel as part of general exception
handling.

Having looked over the code, it appears that the parameter may be handed down
through up to twenty or so layers of functions. Consider a USB character
device attached to a USB hub, attached to a USB controller that posts its
interrupts through a cascaded auxiliary interrupt controller. A character
device driver may want to pass regs to the sysrq handler through the input
layer which adds another few layers of parameter passing.

I've build this code with allyesconfig for x86_64 and i386. I've runtested the
main part of the code on FRV and i386, though I can't test most of the drivers.
I've also done partial conversion for powerpc and MIPS - these at least compile
with minimal configurations.

This will affect all archs. Mostly the changes should be relatively easy.
Take do_IRQ(), store the regs pointer at the beginning, saving the old one:

struct pt_regs *old_regs = set_irq_regs(regs);

And put the old one back at the end:

set_irq_regs(old_regs);

Don't pass regs through to generic_handle_irq() or __do_IRQ().

In timer_interrupt(), this sort of change will be necessary:

- update_process_times(user_mode(regs));
- profile_tick(CPU_PROFILING, regs);
+ update_process_times(user_mode(get_irq_regs()));
+ profile_tick(CPU_PROFILING);

I'd like to move update_process_times()'s use of get_irq_regs() into itself,
except that i386, alone of the archs, uses something other than user_mode().

Some notes on the interrupt handling in the drivers:

(*) input_dev() is now gone entirely. The regs pointer is no longer stored in
the input_dev struct.

(*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking. It does
something different depending on whether it's been supplied with a regs
pointer or not.

(*) Various IRQ handler function pointers have been moved to type
irq_handler_t.

Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1b16e7ac850969f38b375e511e3fa2f474a33867 commit)


# 8cffc6ac 03-Jul-2006 Benjamin Herrenschmidt <benh@kernel.crashing.org>

[POWERPC] Fix non-MPIC CHRPs with CONFIG_SMP set

Pseudo-CHRP machines like Pegasos without an MPIC would crash at boot if
CONFIG_SMP was set because the "smp_ops" pointer was set to MPIC related
ops unconditionally. This patch makes it NULL on machines that don't
support SMP and provides proper default behaviour in the callers when
smp_ops is NULL.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 1e031d65 03-Jul-2006 Benjamin Herrenschmidt <benh@kernel.crashing.org>

[POWERPC] Fix non-MPIC CHRPs with CONFIG_SMP set

Pseudo-CHRP machines like Pegasos without an MPIC would crash at boot if
CONFIG_SMP was set because the "smp_ops" pointer was set to MPIC related
ops unconditionally. This patch makes it NULL on machines that don't
support SMP and provides proper default behaviour in the callers when
smp_ops is NULL.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# ee0339f2 17-Jun-2006 Jon Loeliger <jdl@jdl.com>

[POWERPC] Add starting of secondary 86xx CPUs.

Clear the high BATS during load_up_mmu if FTR_HAS_HIGH_BATS.
Allow just a bit more time for secondary CPUs to phone home.

Signed-off-by: Wei Zhang <Wei.Zhang@freescale.com>
Signed-off-by: Haiying Wang <Haiying.Wang@freescale.com>
Signed-off-by: Jon Loeliger <jdl@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 0e551954 28-Mar-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] for_each_possible_cpu: powerpc

for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.

We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.

This patch replaces for_each_cpu with for_each_possible_cpu.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# c6622f63 23-Feb-2006 Paul Mackerras <paulus@samba.org>

powerpc: Implement accurate task and CPU time accounting

This implements accurate task and cpu time accounting for 64-bit
powerpc kernels. Instead of accounting a whole jiffy of time to a
task on a timer interrupt because that task happened to be running at
the time, we now account time in units of timebase ticks according to
the actual time spent by the task in user mode and kernel mode. We
also count the time spent processing hardware and software interrupts
accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
that is not set, we do tick-based approximate accounting as before.

To get this accurate information, we read either the PURR (processor
utilization of resources register) on POWER5 machines, or the timebase
on other machines on

* each entry to the kernel from usermode
* each exit to usermode
* transitions between process context, hard irq context and soft irq
context in kernel mode
* context switches.

On POWER5 systems with shared-processor logical partitioning we also
read both the PURR and the timebase at each timer interrupt and
context switch in order to determine how much time has been taken by
the hypervisor to run other partitions ("steal" time). Unfortunately,
since we need values of the PURR on both threads at the same time to
accurately calculate the steal time, and since we can only calculate
steal time on a per-core basis, the apportioning of the steal time
between idle time (time which we ceded to the hypervisor in the idle
loop) and actual stolen time is somewhat approximate at the moment.

This is all based quite heavily on what s390 does, and it uses the
generic interfaces that were added by the s390 developers,
i.e. account_system_time(), account_user_time(), etc.

This patch doesn't add any new interfaces between the kernel and
userspace, and doesn't change the units in which time is reported to
userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(),
times(), etc. Internally the various task and cpu times are stored in
timebase units, but they are converted to USER_HZ units (1/100th of a
second) when reported to userspace. Some precision is therefore lost
but there should not be any accumulating error, since the internal
accumulation is at full precision.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 7d4d6154 06-Feb-2006 Nathan Lynch <ntl@pobox.com>

[PATCH] powerpc: avoid timer interrupt replay effect when onlining cpu

When a cpu is hotplug-onlined, if we don't set per_cpu(last_jiffy) to
something sane, timer_interrupt will execute its while loop for every
tick missed since the cpu was last online (or since the system was
booted, if we're adding a new cpu). This can cause weird hangs, ssh
sessions dropping, and we can even go xmon if we take a global IPI at
the wrong time.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# b5e2fc1c 12-Jan-2006 Al Viro <viro@ftp.linux.org.uk>

[PATCH] powerpc: task_thread_info()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4b703a23 12-Dec-2005 Anton Blanchard <anton@samba.org>

[PATCH] ppc64: Add NUMA cpu summary at boot

We used to print a NUMA cpu summary at boot before the hotplug cpu code
was added. This has been useful for catching machine configuration as
well as firmware bugs in the past.

This patch restores that functionality. An example of the output is:

Node 0 CPUs: 0-7
Node 1 CPUs: 8-15

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# cc532915 04-Dec-2005 Michael Ellerman <michael@ellerman.id.au>

[PATCH] powerpc: Add arch dependent basic infrastructure for Kdump.

Implementing the machine_crash_shutdown which will be called by
crash_kexec (called in case of a panic, sysrq etc.). Disable the
interrupts, shootdown cpus using debugger IPI and collect regs
for all CPUs.

elfcorehdr= specifies the location of elf core header stored by
the crashed kernel. This command line option will be passed by
the kexec-tools to capture kernel.

savemaxmem= specifies the actual memory size that the first kernel
has and this value will be used for dumping in the capture kernel.
This command line option will be passed by the kexec-tools to
capture kernel.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 404849bb 23-Nov-2005 David Gibson <david@gibson.dropbear.id.au>

[PATCH] powerpc: Remove some unneeded fields from the paca

This patch removes several unnecessary fields from the paca:

- next_jiffy_update_tb was simply unused. Remove trivially.

- The exdsi exception save area was not used. There were plans to use
it, but they never seem to have gone anywhere. If they ever do, we
can put it back. Remove from the paca, and from asm-offsets.c

- The default_decr field was used from asm, but was only ever assigned
the value of tb_ticks_per_jiffy. Just access tb_ticks_per_jiffy from
asm directly instead.

Built and booted on POWER5 LPAR and iSeries RS64.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# f9e4ec57 14-Nov-2005 Michael Ellerman <michael@ellerman.id.au>

[PATCH] powerpc: More debugging fixups

Add a few more missing includes of udbg.h

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# a7f290da 11-Nov-2005 Benjamin Herrenschmidt <benh@kernel.crashing.org>

[PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel

This patch moves the vdso's to arch/powerpc, adds support for the 32
bits vdso to the 32 bits kernel, rename systemcfg (finally !), and adds
some new (still untested) routines to both vdso's: clock_gettime() with
support for CLOCK_REALTIME and CLOCK_MONOTONIC, clock_getres() (same
clocks) and get_tbfreq() for glibc to retreive the timebase frequency.

Tom,Steve: The implementation of get_tbfreq() I've done for 32 bits
returns a long long (r3, r4) not a long. This is such that if we ever
add support for >4Ghz timebases on ppc32, the userland interface won't
have to change.

I have tested gettimeofday() using some glibc patches in both ppc32 and
ppc64 kernels using 32 bits userland (I haven't had a chance to test a
64 bits userland yet, but the implementation didn't change and was
tested earlier). I haven't tested yet the new functions.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 094fe2e7 09-Nov-2005 Paul Mackerras <paulus@samba.org>

powerpc: Fixes for 32-bit powermac SMP

A couple of bugs crept in with the merge of smp.c...

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 799d6046 09-Nov-2005 Paul Mackerras <paulus@samba.org>

[PATCH] powerpc: merge code values for identifying platforms

This patch merges platform codes. systemcfg->platform is no longer used,
systemcfg use in general is deprecated as much as possible (and renamed
_systemcfg before it gets completely moved elsewhere in a future patch),
_machine is now used on ppc64 along as ppc32. Platform codes aren't gone
yet but we are getting a step closer. A bunch of asm code in head[_64].S
is also turned into C code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# e4d76e1c 09-Nov-2005 Andrew Morton <akpm@osdl.org>

[PATCH] powerpc: sched fixups

- Re-add a hunk lost during merge: ppc64 is missing the hunk that disables
preempt on the secondary CPUs before they call cpu_idle().

- ppc's cpu_idle() had the need_resched() test wrong.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 570142ca 07-Nov-2005 Anton Blanchard <anton@samba.org>

[PATCH] ppc64: remove some direct xmon calls

Even though we can enable and disable xmon at runtime now, there are a
few places in the merge tree that call xmon and xmon_printf directly.

In the case below we call die() which will call xmon if it is enabled.

Also remove an unnecessary include of xmon.h in smp.c.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 5ad57078 04-Nov-2005 Paul Mackerras <paulus@samba.org>

powerpc: Merge smp.c and smp.h

This also moves setup_cpu_maps to setup-common.c (calling it
smp_setup_cpu_maps) and uses it on both 32-bit and 64-bit.

Signed-off-by: Paul Mackerras <paulus@samba.org>