History log of /linux-master/arch/powerpc/mm/numa.c
Revision Date Author Comments
# 2a066ae1 14-Dec-2023 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Stop using of_root

Replace all usages of of_root by of_find_node_by_path("/")

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231214103152.12269-5-mpe@ellerman.id.au


# c040c748 22-Aug-2023 Michael Ellerman <mpe@ellerman.id.au>

powerpc/pseries: Move VPHN constants into vphn.h

These don't have any particularly good reason to belong in lppaca.h,
move them into their own header.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230823055317.751786-1-mpe@ellerman.id.au


# 2500763d 29-Mar-2023 Rob Herring <robh@kernel.org>

powerpc: Use of_address_to_resource()

Replace open coded reading of "reg" or of_get_address()/
of_translate_address() calls with a single call to
of_address_to_resource().

Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230329220337.141295-1-robh@kernel.org


# b277fc79 03-Apr-2023 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/papr_scm: Update the NUMA distance table for the target node

Platform device helper routines won't update the NUMA distance table
while creating a platform device, even if the device is present on a
NUMA node that doesn't have memory or CPU. This is especially true for
pmem devices. If the target node of the pmem device is not online, we
find the nearest online node to the device and associate the pmem device
with that online node. To find the nearest online node, we should have
the numa distance table updated correctly. Update the distance
information during the device probe.

For a papr scm device on NUMA node 3 distance_lookup_table value for
distance_ref_points_depth = 2 before and after fix is below:

Before fix:
node 3 distance depth 0 - 0
node 3 distance depth 1 - 0
node 4 distance depth 0 - 4
node 4 distance depth 1 - 2
node 5 distance depth 0 - 5
node 5 distance depth 1 - 1

After fix
node 3 distance depth 0 - 3
node 3 distance depth 1 - 1
node 4 distance depth 0 - 4
node 4 distance depth 1 - 2
node 5 distance depth 0 - 5
node 5 distance depth 1 - 1

Without the fix, the nearest numa node to the pmem device (NUMA node 3)
will be picked as 4. After the fix, we get the correct numa node which
is 5.

Fixes: da1115fdbd6e ("powerpc/nvdimm: Pick nearby online node if the device node is not online")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230404041433.1781804-1-aneesh.kumar@linux.ibm.com


# 7b31f7da 03-Jul-2022 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mm: Always update max/min_low_pfn in mem_topology_setup()

For both CONFIG_NUMA enabled/disabled use mem_topology_setup() to
update max/min_low_pfn.

This also adds min_low_pfn update to CONFIG_NUMA which was initialized
to zero before. (mpe: Though MEMORY_START is == 0 for PPC64=y which is
all possible NUMA=y systems)

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220704063851.295482-1-aneesh.kumar@linux.ibm.com


# 48b63961 11-Apr-2022 Oscar Salvador <osalvador@suse.de>

powerpc/numa: Associate numa node to its cpu earlier

powerpc is the only platform that do not rely on
cpu_up()->try_online_node() to bring up a numa node,
and special cases it, instead, deep in its own machinery:

dlpar_online_cpu
find_and_online_cpu_nid
try_online_node

This should not be needed, but the thing is that the try_online_node()
from cpu_up() will not apply on the right node, because cpu_to_node()
will return the old mapping numa<->cpu that gets set on boot stage
for all possible cpus.

That can be seen easily if we try to print out the numa node passed
to try_online_node() in cpu_up().

The thing is that the numa<->cpu mapping does not get updated till a much
later stage in start_secondary:

start_secondary:
set_numa_node(numa_cpu_lookup_table[cpu])

But we do not really care, as we already now the
CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
so let us make use of that and set the proper numa<->cpu mapping,
so cpu_to_node() in cpu_up() returns the right node and
try_online_node() can do its work.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220411074934.4632-1-osalvador@suse.de


# 86c38fec 08-Mar-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Remove asm/prom.h from all files that don't need it

Several files include asm/prom.h for no reason.

Clean it up.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Drop change to prom_parse.c as reported by lkp@intel.com]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7c9b8fda63dcf63e1b28f43e7ebdb95182cbc286.1646767214.git.christophe.leroy@csgroup.eu


# e62520b8 11-Jul-2021 Dwaipayan Ray <dwaipayanray1@gmail.com>

powerpc/mm: Switch from __FUNCTION__ to __func__

__FUNCTION__ exists only for backwards compatibility reasons with old
gcc versions. Replace it with __func__.

Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
[chleroy: Fixed parenthesis alignment]
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210711084536.95394-1-dwaipayanray1@gmail.com


# e4ff7759 30-Mar-2022 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Handle partially initialized numa nodes

With commit 09f49dca570a ("mm: handle uninitialized numa nodes
gracefully") NODE_DATA for even a memoryless/cpuless node is partially
initialized at boot time.

Before onlining the node, current Powerpc code checks for NODE_DATA to
be NULL. However since NODE_DATA is partially initialized, this check
will end up always being false.

This causes hotplugging a CPU to a memoryless/cpuless node to fail.

Before adding CPUs:
$ numactl -H
available: 1 nodes (4)
node 4 cpus: 0 1 2 3 4 5 6 7
node 4 size: 97372 MB
node 4 free: 95545 MB
node distances:
node 4
4: 10

$ lparstat
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00

%user %sys %wait %idle physc %entc lbusy app vcsw phint
----- ----- ----- ----- ----- ----- ----- ----- ----- -----
2.66 2.67 0.16 94.51 0.00 0.00 5.33 0.00 67749 0

After hotplugging 32 cores:
$ numactl -H
node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
167 168 169 170 171 172 173 174 175
node 4 size: 97372 MB
node 4 free: 93636 MB
node distances:
node 4
4: 10

$ lparstat
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00

%user %sys %wait %idle physc %entc lbusy app vcsw phint
----- ----- ----- ----- ----- ----- ----- ----- ----- -----
0.04 0.02 0.00 99.94 0.00 0.00 0.06 0.00 1128751 3

As we can see numactl is listing only 8 cores while lparstat is showing
33 cores.

Also dmesg is showing messages like:
[ 2261.318350 ] BUG: arch topology borken
[ 2261.318357 ] the DIE domain not a subset of the NODE domain

Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully")
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220330135123.1868197-1-srikar@linux.vnet.ibm.com


# 749ed4a2 24-Feb-2022 Daniel Henrique Barboza <danielhb413@gmail.com>

powerpc/mm/numa: skip NUMA_NO_NODE onlining in parse_numa_properties()

Executing node_set_online() when nid = NUMA_NO_NODE results in an
undefined behavior. node_set_online() will call node_set_state(), into
__node_set(), into set_bit(), and since NUMA_NO_NODE is -1 we'll end up
doing a negative shift operation inside
arch/powerpc/include/asm/bitops.h. This potential UB was detected
running a kernel with CONFIG_UBSAN.

The behavior was introduced by commit 10f78fd0dabb ("powerpc/numa: Fix a
regression on memoryless node 0"), where the check for nid > 0 was
removed to fix a problem that was happening with nid = 0, but the result
is that now we're trying to online NUMA_NO_NODE nids as well.

Checking for nid >= 0 will allow node 0 to be onlined while avoiding
this UB with NUMA_NO_NODE.

Fixes: 10f78fd0dabb ("powerpc/numa: Fix a regression on memoryless node 0")
Reported-by: Ping Fang <pifang@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220224182312.1012527-1-danielhb413@gmail.com


# c13f2b2b 16-Dec-2021 Nick Child <nick.child@ibm.com>

powerpc/mm: Add __init attribute to eligible functions

Some functions defined in 'arch/powerpc/mm' are deserving of an
`__init` macro attribute. These functions are only called by other
initialization functions and therefore should inherit the attribute.
Also, change function declarations in header files to include `__init`.

Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-4-nick.child@ibm.com


# 30203946 08-Nov-2021 Nicholas Piggin <npiggin@gmail.com>

powerpc/pseries: Fix numa FORM2 parsing fallback code

In case the FORM2 distance table from firmware is not the expected size,
there is fallback code that just populates the lookup table as local vs
remote.

However it then continues on to use the distance table. Fix.

Fixes: 1c6b5a7e7405 ("powerpc/pseries: Add support for FORM2 associativity")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211109064900.2041386-2-npiggin@gmail.com


# 0bd81274 08-Nov-2021 Nicholas Piggin <npiggin@gmail.com>

powerpc/pseries: rename numa_dist_table to form2_distances

The name of the local variable holding the "form2" property address
conflicts with the numa_distance_table global.

This patch does 's/numa_dist_table/form2_distances/g' over the function,
which also renames numa_dist_table_length to form2_distances_length.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211109064900.2041386-1-npiggin@gmail.com


# 9a245d0e 26-Aug-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Update cpu_cpu_map on CPU online/offline

cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated. This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP: c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec
REGS: c00000005596f220 TRAP: 0700 Not tainted (5.13.0-rc6+)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48828222 XER:
00000009
CFAR: c0000000001ea698 IRQMASK: 0
GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036
GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90
GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8
GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800
GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18
GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400
GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060
GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d
NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720
LR [c0000000001caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable)
[c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70
[c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10
[c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590
[c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620
[c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0
[c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100521.412639-5-srikar@linux.vnet.ibm.com


# 544a09ee 26-Aug-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Print debug statements only when required

Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100521.412639-4-srikar@linux.vnet.ibm.com


# 506c2075 26-Aug-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: convert printk to pr_xxx

Convert the remaining printk to pr_xxx
One advantage would be all prints will now have prefix "numa:" from
pr_fmt().

[ convert printk(KERN_ERR) to pr_warn : Suggested by Laurent Dufour ]
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Rebase onto powerpc/next, s/WARNING/Warning/]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100521.412639-3-srikar@linux.vnet.ibm.com


# 544af642 26-Aug-2021 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Drop dbg in favour of pr_debug

powerpc supported numa=debug which is not documented. This option was
used to print early debug output. However something more flexible can be
achieved by using CONFIG_DYNAMIC_DEBUG.

Hence drop dbg (and numa=debug) in favour of pr_debug

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Rebase on to powerpc/next form2 affinity changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100521.412639-2-srikar@linux.vnet.ibm.com


# 1c6b5a7e 12-Aug-2021 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/pseries: Add support for FORM2 associativity

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-6-aneesh.kumar@linux.ibm.com


# ef31cb83 12-Aug-2021 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/pseries: Add a helper for form1 cpu distance

This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-5-aneesh.kumar@linux.ibm.com


# 8ddc6448 12-Aug-2021 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/pseries: Consolidate different NUMA distance update code paths

The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-4-aneesh.kumar@linux.ibm.com


# 0eacd06b 12-Aug-2021 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY

Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-3-aneesh.kumar@linux.ibm.com


# 7e35ef66 12-Aug-2021 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/pseries: rename min_common_depth to primary_domain_index

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-2-aneesh.kumar@linux.ibm.com


# 9c7248bb 11-May-2021 Laurent Dufour <ldufour@linux.ibm.com>

powerpc/numa: Consider the max NUMA node for migratable LPAR

When a LPAR is migratable, we should consider the maximum possible NUMA
node instead of the number of NUMA nodes from the actual system.

The DT property 'ibm,current-associativity-domains' defines the maximum
number of nodes the LPAR can see when running on that box. But if the
LPAR is being migrated on another box, it may see up to the nodes
defined by 'ibm,max-associativity-domains'. So if a LPAR is migratable,
that value should be used.

Unfortunately, there is no easy way to know if an LPAR is migratable or
not. The hypervisor exports the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.

Without this patch, when a LPAR is started on a 2 node box and then
migrated to a 3 node box, the hypervisor may spread the LPAR's CPUs on
the 3rd node. In that case if a CPU from that 3rd node is added to the
LPAR, it will be wrongly assigned to the node because the kernel has
been set to use up to 2 nodes (the configuration of the departure node).
With this patch applies, the CPU is correctly added to the 3rd node.

Fixes: f9f130ff2ec9 ("powerpc/numa: Detect support for coregroup")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210511073136.17795-1-ldufour@linux.ibm.com


# 10f78fd0 26-Nov-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Fix a regression on memoryless node 0

Commit e75130f20b1f ("powerpc/numa: Offline memoryless cpuless node 0")
offlines node 0 and expects nodes to be subsequently onlined when CPUs
or nodes are detected.

Commit 6398eaa26816 ("powerpc/numa: Prefer node id queried from vphn")
skips onlining node 0 when CPUs are associated with node 0.

On systems with node 0 having CPUs but no memory, this causes node 0 be
marked offline. This causes issues at boot time when trying to set
memory node for online CPUs while building the zonelist.

0:mon> t
[link register ] c000000000400354 __build_all_zonelists+0x164/0x280
[c00000000161bda0] c0000000016533c8 node_states+0x20/0xa0 (unreliable)
[c00000000161bdc0] c000000000400384 __build_all_zonelists+0x194/0x280
[c00000000161be30] c000000001041800 build_all_zonelists_init+0x4c/0x118
[c00000000161be80] c0000000004020d0 build_all_zonelists+0x190/0x1b0
[c00000000161bef0] c000000001003cf8 start_kernel+0x18c/0x6a8
[c00000000161bf90] c00000000000adb4 start_here_common+0x1c/0x3e8
0:mon> r
R00 = c000000000400354 R16 = 000000000b57a0e8
R01 = c00000000161bda0 R17 = 000000000b57a6b0
R02 = c00000000161ce00 R18 = 000000000b5afee8
R03 = 0000000000000000 R19 = 000000000b6448a0
R04 = 0000000000000000 R20 = fffffffffffffffd
R05 = 0000000000000000 R21 = 0000000001400000
R06 = 0000000000000000 R22 = 000000001ec00000
R07 = 0000000000000001 R23 = c000000001175580
R08 = 0000000000000000 R24 = c000000001651ed8
R09 = c0000000017e84d8 R25 = c000000001652480
R10 = 0000000000000000 R26 = c000000001175584
R11 = c000000c7fac0d10 R27 = c0000000019568d0
R12 = c000000000400180 R28 = 0000000000000000
R13 = c000000002200000 R29 = c00000000164dd78
R14 = 000000000b579f78 R30 = 0000000000000000
R15 = 000000000b57a2b8 R31 = c000000001175584
pc = c000000000400194 local_memory_node+0x24/0x80
cfar= c000000000074334 mcount+0xc/0x10
lr = c000000000400354 __build_all_zonelists+0x164/0x280
msr = 8000000002001033 cr = 44002284
ctr = c000000000400180 xer = 0000000000000001 trap = 380
dar = 0000000000001388 dsisr = c00000000161bc90
0:mon>

Fix this by setting node to be online while onlining CPUs that belong to
node 0.

Fixes: e75130f20b1f ("powerpc/numa: Offline memoryless cpuless node 0")
Fixes: 6398eaa26816 ("powerpc/numa: Prefer node id queried from vphn")
Reported-by: Milan Mohanty <milmohan@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127053738.10085-1-srikar@linux.vnet.ibm.com


# c9118e6c 13-Oct-2020 Mike Rapoport <rppt@kernel.org>

arch, mm: replace for_each_memblock() with for_each_mem_pfn_range()

There are several occurrences of the following pattern:

for_each_memblock(memory, reg) {
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);

/* do something with start_pfn and end_pfn */
}

Rather than iterate over all memblock.memory regions and each time query
for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
simpler and clearer code.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [.clang-format]
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72cdd117 16-Sep-2020 Scott Cheloha <cheloha@linux.ibm.com>

pseries/hotplug-memory: hot-add: skip redundant LMB lookup

During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().

This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid. In dlpar_add_lmb() we get this address from the LMB itself.

In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.

If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup. The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.

Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.

Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:

Unpatched (12450 seconds):
Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

Patched (7065 seconds):
Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range. This is because we
are skipping a linear LMB search.

To see the distinction, consider smaller hot-add test on the same
LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:

Unpatched:
Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

104,753.42 msec task-clock # 0.992 CPUs utilized ( +- 0.55% )
4,708 context-switches # 0.045 K/sec ( +- 0.69% )
2,444 cpu-migrations # 0.023 K/sec ( +- 1.25% )
394 page-faults # 0.004 K/sec ( +- 0.22% )
445,902,503,057 cycles # 4.257 GHz ( +- 0.55% ) (66.67%)
8,558,376,740 stalled-cycles-frontend # 1.92% frontend cycles idle ( +- 0.88% ) (49.99%)
300,346,181,651 stalled-cycles-backend # 67.36% backend cycles idle ( +- 0.76% ) (50.01%)
258,091,488,691 instructions # 0.58 insn per cycle
# 1.16 stalled cycles per insn ( +- 0.22% ) (66.67%)
70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%)
3,100,725,426 branch-misses # 4.39% of all branches ( +- 0.20% ) (49.99%)

105.583 +- 0.589 seconds time elapsed ( +- 0.56% )

Patched:
Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

104,055.69 msec task-clock # 0.993 CPUs utilized ( +- 0.32% )
4,606 context-switches # 0.044 K/sec ( +- 0.20% )
2,463 cpu-migrations # 0.024 K/sec ( +- 0.93% )
394 page-faults # 0.004 K/sec ( +- 0.25% )
442,951,129,921 cycles # 4.257 GHz ( +- 0.32% ) (66.66%)
8,710,413,329 stalled-cycles-frontend # 1.97% frontend cycles idle ( +- 0.47% ) (50.06%)
299,656,905,836 stalled-cycles-backend # 67.65% backend cycles idle ( +- 0.39% ) (50.02%)
252,731,168,193 instructions # 0.57 insn per cycle
# 1.19 stalled cycles per insn ( +- 0.20% ) (66.66%)
68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%)
3,100,242,882 branch-misses # 4.50% of all branches ( +- 0.15% ) (49.98%)

104.829 +- 0.325 seconds time elapsed ( +- 0.31% )

This is consistent. An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first. On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the drconf range, hence the
smaller speedup.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com


# fa35e868 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Implement cpu_to_coregroup_id

Lookup the coregroup id from the associativity array.

If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.

Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.

If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com


# 72730bfc 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/smp: Create coregroup domain

Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain. Do note this patch is only creating stubs
for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
would be in a subsequent patch.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com


# f9f130ff 09-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Detect support for coregroup

Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups, i.e one sub-group
per DIE.

Currently there are no firmwares that are exposing this grouping. Hence
allow the basis for grouping to be abstract. Once the firmware starts
using this grouping, code would be added to detect the type of grouping
and adjust the sd domain flags accordingly.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com


# e75130f2 18-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Offline memoryless cpuless node 0

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, node 0 as offline, lets stop assuming that node 0 is
always online. If node 0 has CPU or memory that are online, node 0 will
again be set as online.

v5.8
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10

proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

v5.8 + patch
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10

proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

Example of a node with online CPUs/memory on node 0.
(Same o/p with and without patch)
numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 32482 MB
node 0 free: 22994 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 20 40 40
1: 20 10 40 40
2: 40 40 10 20
3: 40 40 20 10

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com


# 6398eaa2 18-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Prefer node id queried from vphn

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com


# a874f100 18-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Set numa_node for all possible cpus

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com


# 67df7784 16-Aug-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Restrict possible nodes based on platform

As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
Abstraction Services (RTAS) Node" available at:
https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf

... there are 2 device tree properties:

"ibm,max-associativity-domains"
which defines the maximum number of domains that the firmware i.e
PowerVM can support.

and:

"ibm,current-associativity-domains"
which defines the maximum number of domains that the current
platform can support.

The value of "ibm,max-associativity-domains" is always greater than or
equal to "ibm,current-associativity-domains" property. If the latter
property is not available, use "ibm,max-associativity-domain" as a
fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
is mentioned in page 833 / B.5.3 which is covered under under
"Appendix B. System Binding" section

Currently powerpc uses the "ibm,max-associativity-domains" property
while setting the possible number of nodes. This is currently set at
32. However the possible number of nodes for a platform may be
significantly less. Hence set the possible number of nodes based on
"ibm,current-associativity-domains" property.

Nathan Lynch had raised a valid concern that post LPM (Live Partition
Migration), a user could DLPAR add processors and memory after LPM
with "new" associativity properties:
https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u

He also pointed out that "ibm,max-associativity-domains" has the same
contents on all currently available PowerVM systems, unlike
"ibm,current-associativity-domains" and hence may be better able to
handle the new NUMA associativity properties.

However with the recent commit dbce45628085 ("powerpc/numa: Limit
possible nodes to within num_possible_nodes"), all new NUMA
associativity properties are capped to initially set nr_node_ids.
Hence this commit should be safe with any new DLPAR add post LPM.

$ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
/proc/device-tree/rtas/ibm,current-associativity-domains
00000005 00000001 00000002 00000002 00000002 00000010
/proc/device-tree/rtas/ibm,max-associativity-domains
00000005 00000001 00000008 00000020 00000020 00000100

$ cat /sys/devices/system/node/possible ##Before patch
0-31

$ cat /sys/devices/system/node/possible ##After patch
0-1

Note the maximum nodes this platform can support is only 2 but the
possible nodes is set to 32.

This is important because lot of kernel and user space code allocate
structures for all possible nodes leading to a lot of memory that is
allocated but not used.

I ran a simple experiment to create and destroy 100 memory cgroups on
boot on a 8 node machine (Power8 Alpine).

Before patch:
free -k at boot
total used free shared buff/cache available
Mem: 523498176 4106816 518820608 22272 570752 516606720
Swap: 4194240 0 4194240

free -k after creating 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4628416 518246464 22336 623296 516058688
Swap: 4194240 0 4194240

free -k after destroying 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4697408 518173760 22400 627008 515987904
Swap: 4194240 0 4194240

After patch:
free -k at boot
total used free shared buff/cache available
Mem: 523498176 3969472 518933888 22272 594816 516731776
Swap: 4194240 0 4194240

free -k after creating 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4181888 518676096 22208 640192 516496448
Swap: 4194240 0 4194240

free -k after destroying 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4232320 518619904 22272 645952 516443264
Swap: 4194240 0 4194240

Observations:
Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Reformat change log a bit for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com


# c89ab04f 07-Aug-2020 Mike Rapoport <rppt@kernel.org>

mm/sparse: cleanup the code surrounding memory_present()

After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
functions that call memory_present() for each region in memblock.memory:
sparse_memory_present_with_active_regions() and membocks_present().

Moreover, all architectures have a call to either of these functions
preceding the call to sparse_init() and in the most cases they are called
one after the other.

Mark the regions from memblock.memory as present during sparce_init() by
making sparse_init() call memblocks_present(), make memblocks_present()
and memory_present() functions static and remove redundant
sparse_memory_present_with_active_regions() function.

Also remove no longer required HAVE_MEMORY_PRESENT configuration option.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# adfefc60 29-Jul-2020 Hari Bathini <hbathini@linux.ibm.com>

powerpc/drmem: Make LMB walk a bit more flexible

Currently, numa & prom are the only users of drmem LMB walk code.
Loading kdump with kexec_file also needs to walk the drmem LMBs to
setup the usable memory ranges for kdump kernel. But there are couple
of issues in using the code as is. One, walk_drmem_lmb() code is built
into the .init section currently, while kexec_file needs it later.
Two, there is no scope to pass data to the callback function for
processing and/or erroring out on certain conditions.

Fix that by, moving drmem LMB walk code out of .init section, adding
scope to pass data to the callback function and bailing out when an
error is encountered in the callback function.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602282727.575379.3979857013827701828.stgit@hbathini


# dbce4562 24-Jul-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Limit possible nodes to within num_possible_nodes

MAX_NUMNODES is a theoretical maximum number of nodes thats is
supported by the kernel. Device tree properties exposes the number of
possible nodes on the current platform. The kernel would detected this
and would use it for most of its resource allocations. If the platform
now increases the nodes to over what was already exposed, then it may
lead to inconsistencies. Hence limit it to the already exposed nodes.

Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724105809.24733-1-srikar@linux.vnet.ibm.com


# cdf082c4 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove arch_update_cpu_topology

Since arch_update_cpu_topology() doesn't do anything on powerpc now,
remove it and associated dead code.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-15-nathanl@linux.ibm.com


# 042ef7cc 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove prrn_is_enabled()

All users of this prrn_is_enabled() are gone; remove it.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-14-nathanl@linux.ibm.com


# 1835303e 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove start/stop_topology_update()

These APIs have become no-ops, so remove them and all call sites.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-12-nathanl@linux.ibm.com


# b1815aea 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove timed_topology_update()

timed_topology_update is a no-op now, so remove it and all call sites.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-11-nathanl@linux.ibm.com


# 893ec646 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: stub out numa_update_cpu_topology()

Previous changes have removed the code which sets bits in
cpu_associativity_changes_mask and thus it is never modifed at
runtime. From this we can reason that numa_update_cpu_topology()
always returns 0 without doing anything. Remove the body of
numa_update_cpu_topology() and remove all code which becomes
unreachable as a result.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-10-nathanl@linux.ibm.com


# 9fb8b5fd 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove vphn_enabled and prrn_enabled internal flags

These flags are always zero now; remove them and suitably adjust the
remaining references to them.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-9-nathanl@linux.ibm.com


# 6325cb4a 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove unreachable topology workqueue code

Since vphn_enabled is always 0, we can remove the call to
topology_schedule_update() and remove the code which becomes
unreachable as a result.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-8-nathanl@linux.ibm.com


# 50e0cf37 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove unreachable topology timer code

Since vphn_enabled is always 0, we can stub out
timed_topology_update() and remove the code which becomes unreachable.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-7-nathanl@linux.ibm.com


# e6eacf8e 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: make vphn_enabled, prrn_enabled flags const

Previous changes have made it so these flags are never changed;
enforce this by making them const.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-6-nathanl@linux.ibm.com


# 7d35bef9 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove unreachable topology update code

Since the topology_updates_enabled flag is now always false, remove it
and the code which has become unreachable. This is the minimum change
that prevents 'defined but unused' warnings emitted by the compiler
after stubbing out the start/stop_topology_updates() functions.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-5-nathanl@linux.ibm.com


# c30f931e 11-Jun-2020 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: remove ability to enable topology updates

Remove the /proc/powerpc/topology_updates interface and the
topology_updates=on/off command line argument. The internal
topology_updates_enabled flag remains for now, but always false.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-4-nathanl@linux.ibm.com


# 247257b0 29-Jan-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Remove late request for home node associativity

With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200129135301.24739-6-srikar@linux.vnet.ibm.com


# dc909d8b 29-Jan-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Early request for home node associativity

Currently the kernel detects if its running on a shared lpar platform
and requests home node associativity before the scheduler sched_domains
are setup. However between the time NUMA setup is initialized and the
request for home node associativity, workqueue initializes its per node
cpumask. The per node workqueue possible cpumask may turn invalid
after home node associativity resulting in weird situations like
workqueue possible cpumask being a subset of workqueue online cpumask.

This can be fixed by requesting home node associativity earlier just
before NUMA setup. However at the NUMA setup time, kernel may not be in
a position to detect if its running on a shared lpar platform. So
request for home node associativity and if the request fails, fallback
on the device tree property.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200129135301.24739-5-srikar@linux.vnet.ibm.com


# 413e4055 29-Jan-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Use cpu node map of first sibling thread

All the sibling threads of a core have to be part of the same node.
To ensure that all the sibling threads map to the same node, always
lookup/update the cpu-to-node map of the first thread in the core.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200129135301.24739-4-srikar@linux.vnet.ibm.com


# 76b7bfb1 29-Jan-2020 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Handle extra hcall_vphn error cases

Currently code handles H_FUNCTION, H_SUCCESS, H_HARDWARE return codes.
However hcall_vphn can return other return codes. Now it also handles
H_PARAMETER return code. Also the rest return codes are handled under the
default case.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200129135301.24739-3-srikar@linux.vnet.ibm.com


# 97a32539 03-Feb-2020 Alexey Dobriyan <adobriyan@gmail.com>

proc: convert everything to "struct proc_ops"

The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

llseek => proc_lseek
unlocked_ioctl => proc_ioctl

xxx => proc_xxx

delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 495c2ff4 01-Jul-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mm: Consolidate numa_enable check and min_common_depth check

If we fail to parse min_common_depth from device tree we boot with
numa disabled. Reflect the same by updating numa_enabled variable
to false. Also, switch all min_common_depth failure check to
if (!numa_enabled) check.

This helps us to avoid checking for both in different code paths.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# f52741c4 01-Jul-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mm: Fix node look up with numa=off boot

If we boot with numa=off, we need to make sure we return NUMA_NO_NODE when
looking up associativity details of resources. Without this, we hit crash
like below

BUG: Unable to handle kernel data access at 0x40000000008
Faulting instruction address: 0xc000000008f31704
cpu 0x1b: Vector: 380 (Data SLB Access) at [c00000000b9bb320]
pc: c000000008f31704: _raw_spin_lock+0x14/0x100
lr: c0000000083f41fc: ____cache_alloc_node+0x5c/0x290
sp: c00000000b9bb5b0
msr: 800000010280b033
dar: 40000000008
current = 0xc00000000b9a2700
paca = 0xc00000000a740c00 irqmask: 0x03 irq_happened: 0x01
pid = 1, comm = swapper/27
Linux version 5.2.0-rc4-00925-g74e188c620b1 (root@linux-d8ip) (gcc version 7.4.1 20190424 [gcc-7-branch revision 270538] (SUSE Linux)) #34 SMP Sat Jun 29 00:41:02 EDT 2019
enter ? for help
[link register ] c0000000083f41fc ____cache_alloc_node+0x5c/0x290
[c00000000b9bb5b0] 0000000000000dc0 (unreliable)
[c00000000b9bb5f0] c0000000083f48c8 kmem_cache_alloc_node_trace+0x138/0x360
[c00000000b9bb670] c000000008aa789c devres_alloc_node+0x4c/0xa0
[c00000000b9bb6a0] c000000008337218 devm_memremap+0x58/0x130
[c00000000b9bb6f0] c000000008aed00c devm_nsio_enable+0xdc/0x170
[c00000000b9bb780] c000000008af3b6c nd_pmem_probe+0x4c/0x180
[c00000000b9bb7b0] c000000008ad84cc nvdimm_bus_probe+0xac/0x260
[c00000000b9bb840] c000000008aa0628 really_probe+0x148/0x500
[c00000000b9bb8d0] c000000008aa0d7c driver_probe_device+0x19c/0x1d0
[c00000000b9bb950] c000000008aa11bc device_driver_attach+0xcc/0x100
[c00000000b9bb990] c000000008aa12ec __driver_attach+0xfc/0x1e0
[c00000000b9bba10] c000000008a9d0a4 bus_for_each_dev+0xb4/0x130
[c00000000b9bba70] c000000008a9fc04 driver_attach+0x34/0x50
[c00000000b9bba90] c000000008a9f118 bus_add_driver+0x1d8/0x300
[c00000000b9bbb20] c000000008aa2358 driver_register+0x98/0x1a0
[c00000000b9bbb90] c000000008ad7e6c __nd_driver_register+0x5c/0x100
[c00000000b9bbbf0] c0000000093efbac nd_pmem_driver_init+0x34/0x48
[c00000000b9bbc10] c0000000080106c0 do_one_initcall+0x60/0x2d0
[c00000000b9bbce0] c00000000938463c kernel_init_freeable+0x384/0x48c
[c00000000b9bbdb0] c000000008010a5c kernel_init+0x2c/0x160
[c00000000b9bbe20] c00000000800ba54 ret_from_kernel_thread+0x5c/0x68

Reported-and-debugged-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ea9f5b70 01-Jul-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mm/drconf: Use NUMA_NO_NODE on failures instead of node 0

If we fail to parse the associativity array we should default to
NUMA_NO_NODE instead of NODE 0. Rest of the code fallback to the
right default if we find the numa node value NUMA_NO_NODE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# d62c8dee 03-Jul-2019 Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

powerpc/pseries: Provide vcpu dispatch statistics

For Shared Processor LPARs, the POWER Hypervisor maintains a
relatively static mapping of the LPAR processors (vcpus) to physical
processor chips (representing the "home" node) and tries to always
dispatch vcpus on their associated physical processor chip. However,
under certain scenarios, vcpus may be dispatched on a different
processor chip (away from its home node). The actual physical
processor number on which a certain vcpu is dispatched is available to
the guest in the 'processor_id' field of each DTL entry.

The guest can discover the home node of each vcpu through the
H_HOME_NODE_ASSOCIATIVITY(flags=1) hcall. The guest can also discover
the associativity of physical processors, as represented in the DTL
entry, through the H_HOME_NODE_ASSOCIATIVITY(flags=2) hcall.

These can then be compared to determine if the vcpu was dispatched on
its home node or not. If the vcpu was not dispatched on the home node,
it is possible to determine if the vcpu was dispatched in a different
chip, socket or drawer.

Introduce a procfs file /proc/powerpc/vcpudispatch_stats that can be
used to obtain these statistics. Writing '1' to this file enables
collecting the statistics, while writing '0' disables the statistics.
The statistics themselves are available by reading the procfs file. By
default, the DTLB log for each vcpu is processed 50 times a second so
as not to miss any entries. This processing frequency can be changed
through /proc/powerpc/vcpudispatch_stats_freq.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5a1ea477 03-Jul-2019 Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

powerpc/pseries: Move mm/book3s64/vphn.c under platforms/pseries/

hcall_vphn() is specific to pseries and will be used in a subsequent
patch. So, move it to a more appropriate place under
arch/powerpc/platforms/pseries. Also merge vphn.h into lppaca.h
and update vphn selftest to use the new files.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ef34e0ef 03-Jul-2019 Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

powerpc/pseries: Generalize hcall_vphn()

H_HOME_NODE_ASSOCIATIVITY hcall can take two different flags and return
different associativity information in each case. Generalize the
existing hcall_vphn() function to take flags as an argument and to
return the result. Update the only existing user to pass the proper
arguments.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 47d99948 29-Mar-2019 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/mm: Move book3s64 specifics in subdirectory mm/book3s64

Many files in arch/powerpc/mm are only for book3S64. This patch
creates a subdirectory for them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Update the selftest sym links, shorten new filenames, cleanup some
whitespace and formatting in the new files.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 558f8649 18-Apr-2019 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: document topology_updates_enabled, disable by default

Changing the NUMA associations for CPUs and memory at runtime is
basically unsupported by the core mm, scheduler etc. We see all manner
of crashes, warnings and instability when the pseries code tries to do
this. Disable this behavior by default, and document the switch a bit.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2d4d9b30 18-Apr-2019 Nathan Lynch <nathanl@linux.ibm.com>

powerpc/numa: improve control of topology updates

When booted with "topology_updates=no", or when "off" is written to
/proc/powerpc/topology_updates, NUMA reassignments are inhibited for
PRRN and VPHN events. However, migration and suspend unconditionally
re-enable reassignments via start_topology_update(). This is
incoherent.

Check the topology_updates_enabled flag in
start/stop_topology_update() so that callers of those APIs need not be
aware of whether reassignments are enabled. This allows the
administrative decision on reassignments to remain in force across
migrations and suspensions.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6917735e 23-Mar-2019 Jagadeesh Pagadala <jagdsh.linux@gmail.com>

powerpc: Remove duplicate headers

Remove duplicate headers inclusions.

Signed-off-by: Jagadeesh Pagadala <jagdsh.linux@gmail.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 33755574 12-Mar-2019 Mike Rapoport <rppt@kernel.org>

memblock: memblock_phys_alloc_try_nid(): don't panic

The memblock_phys_alloc_try_nid() function tries to allocate memory from
the requested node and then falls back to allocation from any node in
the system. The memblock_alloc_base() fallback used by this function
panics if the allocation fails.

Replace the memblock_alloc_base() fallback with the direct call to
memblock_alloc_range_nid() and update the memblock_phys_alloc_try_nid()
callers to check the returned value and panic in case of error.

Link: http://lkml.kernel.org/r/1548057848-15136-7-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com> [c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com> [Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9726c26 05-Mar-2019 Alexey Dobriyan <adobriyan@gmail.com>

numa: make "nr_node_ids" unsigned int

Number of NUMA nodes can't be negative.

This saves a few bytes on x86_64:

add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
Function old new delta
hv_synic_alloc.cold 88 110 +22
prealloc_shrinker 260 262 +2
bootstrap 249 251 +2
sched_init_numa 1566 1567 +1
show_slab_objects 778 777 -1
s_show 1201 1200 -1
kmem_cache_init 346 345 -1
__alloc_workqueue_key 1146 1145 -1
mem_cgroup_css_alloc 1614 1612 -2
__do_sys_swapon 4702 4699 -3
__list_lru_init 655 651 -4
nic_probe 2379 2374 -5
store_user_store 118 111 -7
red_zone_store 106 99 -7
poison_store 106 99 -7
wq_numa_init 348 338 -10
__kmem_cache_empty 75 65 -10
task_numa_free 186 173 -13
merge_across_nodes_store 351 336 -15
irq_create_affinity_masks 1261 1246 -15
do_numa_crng_init 343 321 -22
task_numa_fault 4760 4737 -23
swapfile_init 179 156 -23
hv_synic_alloc 536 492 -44
apply_wqattrs_prepare 746 695 -51

Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98fa15f3 05-Mar-2019 Anshuman Khandual <anshuman.khandual@arm.com>

mm: replace all open encodings for NUMA_NO_NODE

Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 81b61324 29-Oct-2018 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/pseries: Perform full re-add of CPU for topology update post-migration

On pseries systems, performing a partition migration can result in
altering the nodes a CPU is assigned to on the destination system. For
exampl, pre-migration on the source system CPUs are in node 1 and 3,
post-migration on the destination system CPUs are in nodes 2 and 3.

Handling the node change for a CPU can cause corruption in the slab
cache if we hit a timing where a CPUs node is changed while cache_reap()
is invoked. The corruption occurs because the slab cache code appears
to rely on the CPU and slab cache pages being on the same node.

The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c
does not prevent us from hitting this scenario.

Changing the device tree property update notification handler that
recognizes an affinity change for a CPU to do a full DLPAR remove and
add of the CPU instead of dynamically changing its node resolves this
issue.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael W. Bringmann <mwb@linux.vnet.ibm.com>
Tested-by: Michael W. Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e5480bdc 16-Nov-2018 Rob Herring <robh@kernel.org>

powerpc: Use device_type helpers to access the node type

Remove directly accessing device_node.type pointer and use the
accessors instead. This will eventually allow removing the type
pointer.

Replace the open coded iterating over child nodes with
for_each_child_of_node() while we're here.

Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 437ccdc8 07-Nov-2018 Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>

powerpc/numa: Suppress "VPHN is not supported" messages

When VPHN function is not supported and during cpu hotplug event,
kernel prints message 'VPHN function not supported. Disabling
polling...'. Currently it prints on every hotplug event, it floods
dmesg when a KVM guest tries to hotplug huge number of vcpus, let's
just print once and suppress further kernel prints.

Signed-off-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 57c8a661 30-Oct-2018 Mike Rapoport <rppt@linux.vnet.ibm.com>

mm: remove include/linux/bootmem.h

Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a8dd708 30-Oct-2018 Mike Rapoport <rppt@linux.vnet.ibm.com>

memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*

Make it explicit that the caller gets a physical address rather than a
virtual one.

This will also allow using meblock_alloc prefix for memblock allocations
returning virtual address, which is done in the following patches.

The conversion is done using the following semantic patch:

@@
expression e1, e2, e3;
@@
(
- memblock_alloc(e1, e2)
+ memblock_phys_alloc(e1, e2)
|
- memblock_alloc_nid(e1, e2, e3)
+ memblock_phys_alloc_nid(e1, e2, e3)
|
- memblock_alloc_try_nid(e1, e2, e3)
+ memblock_phys_alloc_try_nid(e1, e2, e3)
)

Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65b9fdad 09-Oct-2018 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/pseries/mobility: Extend start/stop topology update scope

The powerpc mobility code may receive RTAS requests to perform PRRN
(Platform Resource Reassignment Notification) topology changes at any
time, including during LPAR migration operations.

In some configurations where the affinity of CPUs or memory is being
changed on that platform, the PRRN requests may apply or refer to
outdated information prior to the complete update of the device-tree.

This patch changes the duration for which topology updates are
suppressed during LPAR migrations from just the rtas_ibm_suspend_me()
/ 'ibm,suspend-me' call(s) to cover the entire migration_store()
operation to allow all changes to the device-tree to be applied prior
to accepting and applying any PRRN requests.

For tracking purposes, pr_info notices are added to the functions
start_topology_update() and stop_topology_update() of 'numa.c'.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ac1788cc 27-Sep-2018 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Skip onlining a offline node in kdump path

With commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot"), kdump kernel on shared LPAR may crash.

The necessary conditions are
- Shared LPAR with at least 2 nodes having memory and CPUs.
- Memory requirement for kdump kernel must be met by the first N-1
nodes where there are at least N nodes with memory and CPUs.

Example numactl of such a machine.
$ numactl -H
available: 5 nodes (0,2,5-7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 255 MB
node 2 free: 189 MB
node 5 cpus: 24 25 26 27 28 29 30 31
node 5 size: 4095 MB
node 5 free: 4024 MB
node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 6 size: 6353 MB
node 6 free: 5998 MB
node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
node 7 size: 7640 MB
node 7 free: 7164 MB
node distances:
node 0 2 5 6 7
0: 10 40 40 40 40
2: 40 10 40 40 40
5: 40 40 10 40 40
6: 40 40 40 10 20
7: 40 40 40 20 10

Steps to reproduce.
1. Load / start kdump service.
2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)

When booting a kdump kernel with 2048M:

kexec: Starting switchover sequence.
I'm in purgatory
Using 1TB segments
hash-mmu: Initializing hash mmu with SLB
Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
Found initrd at 0xc000000009e70000:0xc00000000ae554b4
Using pSeries machine description
-----------------------------------------------------
ppc64_pft_size = 0x1e
phys_mem_size = 0x88000000
dcache_bsize = 0x80
icache_bsize = 0x80
cpu_features = 0x000000ff8f5d91a7
possible = 0x0000fbffcf5fb1a7
always = 0x0000006f8b5c91a1
cpu_user_features = 0xdc0065c2 0xef000000
mmu_features = 0x7c006001
firmware_features = 0x00000007c45bfc57
htab_hash_mask = 0x7fffff
physical_start = 0x8000000
-----------------------------------------------------
numa: NODE_DATA [mem 0x87d5e300-0x87d67fff]
numa: NODE_DATA(0) on node 6
numa: NODE_DATA [mem 0x87d54600-0x87d5e2ff]
Top of RAM: 0x88000000, Total RAM: 0x88000000
Memory hole size: 0MB
Zone ranges:
DMA [mem 0x0000000000000000-0x0000000087ffffff]
DMA32 empty
Normal empty
Movable zone start for each node
Early memory node ranges
node 6: [mem 0x0000000000000000-0x0000000087ffffff]
Could not find start_pfn for node 0
Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
On node 0 totalpages: 0
Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
On node 6 totalpages: 34816

Unable to handle kernel paging request for data at address 0x00000060
Faulting instruction address: 0xc000000008703a54
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
NIP: c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000
REGS: c00000000b673440 TRAP: 0380 Not tainted (4.19.0-rc5-master+)
MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 24022022 XER: 20000002
CFAR: c0000000086fc238 IRQMASK: 0
GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220
GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008
GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
NIP [c000000008703a54] bus_add_device+0x84/0x1e0
LR [c000000008703a38] bus_add_device+0x68/0x1e0
Call Trace:
[c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
[c00000000b673740] [c000000008700194] device_add+0x454/0x7c0
[c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240
[c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180
[c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90
[c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
[c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580
[c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108
[c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c
[c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450
[c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160
[c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
Instruction dump:
60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378
---[ end trace 593577668c2daa65 ]---

However a regular kernel with 4096M (2048 gets reserved for crash
kernel) boots properly.

Unlike regular kernels, which mark all available nodes as online,
kdump kernel only marks just enough nodes as online and marks the rest
as offline at boot. However kdump kernel boots with all available
CPUs. With Commit 2ea626306810 ("powerpc/topology: Get topology for
shared processors at boot"), all CPUs are onlined on their respective
nodes at boot time. try_online_node() tries to online the offline
nodes but fails as all needed subsystems are not yet initialized.

As part of fix, detect and skip early onlining of a offline node.

Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Reported-by: Pavithra Prakash <pavrampu@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2483ef05 25-Sep-2018 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/numa: Use associativity if VPHN hcall is successful

Currently associativity is used to lookup node-id even if the
preceding VPHN hcall failed. However this can cause CPU to be made
part of the wrong node, (most likely to be node 0). This is because
VPHN is not enabled on KVM guests.

With 2ea6263 ("powerpc/topology: Get topology for shared processors at
boot"), associativity is used to set to the wrong node. Hence KVM
guest topology is broken.

For example : A 4 node KVM guest before would have reported.

[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 1746 MB
node 0 free: 1604 MB
node 1 cpus: 4 5 6 7
node 1 size: 2044 MB
node 1 free: 1765 MB
node 2 cpus: 8 9 10 11
node 2 size: 2044 MB
node 2 free: 1837 MB
node 3 cpus: 12 13 14 15
node 3 size: 2044 MB
node 3 free: 1903 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10

Would now report:

[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 1746 MB
node 0 free: 1244 MB
node 1 cpus:
node 1 size: 2044 MB
node 1 free: 2032 MB
node 2 cpus: 1
node 2 size: 2044 MB
node 2 free: 2028 MB
node 3 cpus:
node 3 size: 2044 MB
node 3 free: 2032 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10

Fix this by skipping associativity lookup if the VPHN hcall failed.

Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8604895a 20-Sep-2018 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/pseries: Fix unitialized timer reset on migration

After migration of a powerpc LPAR, the kernel executes code to
update the system state to reflect new platform characteristics.

Such changes include modifications to device tree properties provided
to the system by PHYP. Property notifications received by the
post_mobility_fixup() code are passed along to the kernel in general
through a call to of_update_property() which in turn passes such
events back to all modules through entries like the '.notifier_call'
function within the NUMA module.

When the NUMA module updates its state, it resets its event timer. If
this occurs after a previous call to stop_topology_update() or on a
system without VPHN enabled, the code runs into an unitialized timer
structure and crashes. This patch adds a safety check along this path
toward the problem code.

An example crash log is as follows.

ibmvscsi 30000081: Re-enabling adapter!
------------[ cut here ]------------
kernel BUG at kernel/time/timer.c:958!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
...
NIP mod_timer+0x4c/0x400
LR reset_topology_timer+0x40/0x60
Call Trace:
0xc0000003f9407830 (unreliable)
reset_topology_timer+0x40/0x60
dt_update_callback+0x100/0x120
notifier_call_chain+0x90/0x100
__blocking_notifier_call_chain+0x60/0x90
of_property_notify+0x90/0xd0
of_update_property+0x104/0x150
update_dt_property+0xdc/0x1f0
pseries_devicetree_update+0x2d0/0x510
post_mobility_fixup+0x7c/0xf0
migration_store+0xa4/0xc0
kobj_attr_store+0x30/0x60
sysfs_kf_write+0x64/0xa0
kernfs_fop_write+0x16c/0x240
__vfs_write+0x40/0x200
vfs_write+0xc8/0x240
ksys_write+0x5c/0x100
system_call+0x58/0x6c

Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated")
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2ea62630 17-Aug-2018 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

powerpc/topology: Get topology for shared processors at boot

On a shared LPAR, Phyp will not update the CPU associativity at boot
time. Just after the boot system does recognize itself as a shared
LPAR and trigger a request for correct CPU associativity. But by then
the scheduler would have already created/destroyed its sched domains.

This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all CPUs to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
node sched domain"), can cause rcu stalls at boot up.

The sched_domains_numa_masks table which is used to generate cpumasks
is only created at boot time just before creating sched domains and
never updated. Hence, its better to get the topology correct before
the sched domains are created.

For example on 64 core Power 8 shared LPAR, dmesg reports

Brought up 512 CPUs
Node 0 CPUs: 0-511
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
Node 4 CPUs:
Node 5 CPUs:
Node 6 CPUs:
Node 7 CPUs:
Node 8 CPUs:
Node 9 CPUs:
Node 10 CPUs:
Node 11 CPUs:
...
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain

numactl/lscpu output will still be correct with cores spreading across
all nodes:

Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455

Currently on this LPAR, the scheduler detects 2 levels of Numa and
created numa sched domains for all CPUs, but it finds a single DIE
domain consisting of all CPUs. Hence it deletes all numa sched
domains.

To address this, detect the shared processor and update topology soon
after CPUs are setup so that correct topology is updated just before
scheduler creates sched domain.

With the fix, dmesg reports:

numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
numa: Node 11 CPUs: 160-167 256-263 352-359 448-455

and lscpu also reports:

Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455

Reported-by: Manjunatha H R <manjuhr1@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Trim / format change log]
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6396bb22 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kzalloc() -> kcalloc()

The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

kzalloc(a * b, gfp)

with:
kcalloc(a * b, gfp)

as well as handling cases of:

kzalloc(a * b * c, gfp)

with:

kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# 9bd9be00 13-Feb-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/mm/numa: move numa topology discovery earlier

Split sparsemem initialisation from basic numa topology discovery.
Move the parsing earlier in boot, before pacas are allocated.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 499dcd41 13-Feb-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/64s: Allocate LPPACAs individually

We no longer allocate lppacas in an array, so this patch removes the
1kB static alignment for the structure, and enforces the PAPR
alignment requirements at allocation time. We can not reduce the 1kB
allocation size however, due to existing KVM hypervisors.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 1d9a0907 26-Jan-2018 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/numa: Invalidate numa_cpu_lookup_table on cpu remove

When DLPAR removing a CPU, the unmapping of the cpu from a node in
unmap_cpu_from_node() should also invalidate the CPUs entry in the
numa_cpu_lookup_table. There is not a guarantee that on a subsequent
DLPAR add of the CPU the associativity will be the same and thus
could be in a different node. Invalidating the entry in the
numa_cpu_lookup_table causes the associativity to be read from the
device tree at the time of the add.

The current behavior of not invalidating the CPUs entry in the
numa_cpu_lookup_table can result in scenarios where the the topology
layout of CPUs in the partition does not match the device tree
or the topology reported by the HMC.

This bug looks like it was introduced in 2004 in the commit titled
"ppc64: cpu hotplug notifier for numa", which is 6b15e4e87e32 in the
linux-fullhist tree. Hence tag it for all stable releases.

Cc: stable@vger.kernel.org
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e67e02a5 28-Nov-2017 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/pseries: Fix cpu hotplug crash with memoryless nodes

On powerpc systems with shared configurations of CPUs and memory and
memoryless nodes at boot, an event ordering problem was observed on a
SLES12 build platforms with the hot-add of CPUs to the memoryless
nodes.

* The most common error occurred when the memory SLAB driver attempted
to reference the memoryless node to which a CPU was being added
before the kernel had finished initializing all of the data
structures for the CPU and exited 'device_online' under
DLPAR/hot-add.

Normally the memoryless node would be initialized through the call
path device_online ... arch_update_cpu_topology ... find_cpu_nid ...
try_online_node. This patch ensures that the powerpc node will be
initialized as early as possible, even if it was memoryless and
CPU-less at the point when we are trying to hot-add a new CPU to it.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ea05ba7c 28-Nov-2017 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/numa: Ensure nodes initialized for hotplug

This patch fixes some problems encountered at runtime with
configurations that support memory-less nodes, or that hot-add CPUs
into nodes that are memoryless during system execution after boot. The
problems of interest include:

* Nodes known to powerpc to be memoryless at boot, but to have CPUs in
them are allowed to be 'possible' and 'online'. Memory allocations
for those nodes are taken from another node that does have memory
until and if memory is hot-added to the node.

* Nodes which have no resources assigned at boot, but which may still
be referenced subsequently by affinity or associativity attributes,
are kept in the list of 'possible' nodes for powerpc. Hot-add of
memory or CPUs to the system can reference these nodes and bring
them online instead of redirecting the references to one of the set
of nodes known to have memory at boot.

Note that this software operates under the context of CPU hotplug. We
are not doing memory hotplug in this code, but rather updating the
kernel's CPU topology (i.e. arch_update_cpu_topology /
numa_update_cpu_topology). We are initializing a node that may be used
by CPUs or memory before it can be referenced as invalid by a CPU
hotplug operation. CPU hotplug operations are protected by a range of
APIs including cpu_maps_update_begin/cpu_maps_update_done,
cpus_read/write_lock / cpus_read/write_unlock, device locks, and more.
Memory hotplug operations, including try_online_node, are protected by
mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the
case of CPUs being hot-added to a previously memoryless node, the
try_online_node operation occurs wholly within the CPU locks with no
overlap. Using HMC hot-add/hot-remove operations, we have been able to
add and remove CPUs to any possible node without failures. HMC
operations involve a degree self-serialization, though.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a346137e 28-Nov-2017 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes

On powerpc systems which allow 'hot-add' of CPU or memory resources,
it may occur that the new resources are to be inserted into nodes that
were not used for these resources at bootup. In the kernel, any node
that is used must be defined and initialized. These empty nodes may
occur when,

* Dedicated vs. shared resources. Shared resources require information
such as the VPHN hcall for CPU assignment to nodes. Associativity
decisions made based on dedicated resource rules, such as
associativity properties in the device tree, may vary from decisions
made using the values returned by the VPHN hcall.

* memoryless nodes at boot. Nodes need to be defined as 'possible' at
boot for operation with other code modules. Previously, the powerpc
code would limit the set of possible nodes to those which have
memory assigned at boot, and were thus online. Subsequent add/remove
of CPUs or memory would only work with this subset of possible
nodes.

* memoryless nodes with CPUs at boot. Due to the previous restriction
on nodes, nodes that had CPUs but no memory were being collapsed
into other nodes that did have memory at boot. In practice this
meant that the node assignment presented by the runtime kernel
differed from the affinity and associativity attributes presented by
the device tree or VPHN hcalls. Nodes that might be known to the
pHyp were not 'possible' in the runtime kernel because they did not
have memory at boot.

This patch ensures that sufficient nodes are defined to support
configuration requirements after boot, as well as at boot. This patch
set fixes a couple of problems.

* Nodes known to powerpc to be memoryless at boot, but to have CPUs in
them are allowed to be 'possible' and 'online'. Memory allocations
for those nodes are taken from another node that does have memory
until and if memory is hot-added to the node. * Nodes which have no
resources assigned at boot, but which may still be referenced
subsequently by affinity or associativity attributes, are kept in
the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs
to the system can reference these nodes and bring them online
instead of redirecting to one of the set of nodes that were known to
have memory at boot.

This patch extracts the value of the lowest domain level (number of
allocable resources) from the device tree property
"ibm,max-associativity-domains" to use as the maximum number of nodes
to setup as possibly available in the system. This new setting will
override the instruction:

nodes_and(node_possible_map, node_possible_map, node_online_map);

presently seen in the function arch/powerpc/mm/numa.c:initmem_init().

If the "ibm,max-associativity-domains" property is not present at
boot, no operation will be performed to define or enable additional
nodes, or enable the above 'nodes_and()'.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 514a9cb3 01-Dec-2017 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/numa: Update numa code use walk_drmem_lmbs

Update code in powerpc/numa.c to use the walk_drmem_lmbs()
routine instead of parsing the device tree directly. This is
in anticipation of introducing a new ibm,dynamic-memory-v2
property with a different format. This will allow the numa code
to use a single initialization routine per-LMB irregardless of
the device tree format.

Additionally, to support additional routines in numa.c that need
to look up LMB information, an late_init routine is added to drmem.c
to allocate the array of LMB information. This LMB array will provide
per-LMB information to separate the LMB data from the device tree
format.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# b88fc309 01-Dec-2017 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/numa: Look up associativity array in of_drconf_to_nid_single

Look up the associativity arrays in of_drconf_to_nid_single when
deriving the nid for a LMB instead of having it passed in as a
parameter.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 22508f3d 01-Dec-2017 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/numa: Look up device node in of_get_usable_memory()

Look up the device node for the usable memory property instead
of having it passed in as a parameter. This changes precedes an update
in which the calling routines for of_get_usable_memory() will not have
the device node pointer to pass in.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 35f80deb 01-Dec-2017 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/numa: Look up device node in of_get_assoc_arrays()

Look up the device node for the associativity array property instead
of having it passed in as a parameter. This changes precedes an update
in which the calling routines for of_get_assoc_arrays() will not have
the device node pointer to pass in.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8bc93149 08-Sep-2017 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/vphn: Fix numa update end-loop bug

powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources. This patch
fixes an end-of-updates processing problem observed occasionally
in numa_update_cpu_topology().

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# cee5405d 08-Sep-2017 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/hotplug: Improve responsiveness of hotplug change

powerpc/hotplug: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources. During hotplug
CPU operations, this patch resets the timer on topology update work
function to a small value to better ensure that the CPU topology is
detected and configured sooner.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a3496e91 08-Sep-2017 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/vphn: Improve recognition of PRRN/VPHN

powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources. This patch
updates the initialization checks to independently recognize PRRN
or VPHN support.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 17f444c0 08-Sep-2017 Michael Bringmann <mwb@linux.vnet.ibm.com>

powerpc/vphn: Update CPU topology when VPHN enabled

powerpc/vphn: On Power systems with shared configurations of CPUs
and memory, there are some issues with the association of additional
CPUs and memory to nodes when hot-adding resources. This patch
corrects the currently broken capability to set the topology for
shared CPUs in LPARs. At boot time for shared CPU lpars, the
topology for each CPU was being set to node zero. Now when
numa_update_cpu_topology() is called appropriately, the Virtual
Processor Home Node (VPHN) capabilities information provided by the
pHyp allows the appropriate node in the shared configuration to be
selected for the CPU.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6b2c08f9 04-Oct-2017 Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>

powerpc: Don't call lockdep_assert_cpus_held() from arch_update_cpu_topology()

It turns out that not all paths calling arch_update_cpu_topology() hold
cpu_hotplug_lock, but that's OK because those paths can't race with
any concurrent hotplug events.

Warnings were reported with the following trace:

lockdep_assert_cpus_held
arch_update_cpu_topology
sched_init_domains
sched_init_smp
kernel_init_freeable
kernel_init
ret_from_kernel_thread

Which is safe because it's called early in boot when hotplug is not
live yet.

And also this trace:

lockdep_assert_cpus_held
arch_update_cpu_topology
partition_sched_domains
cpuset_update_active_cpus
sched_cpu_deactivate
cpuhp_invoke_callback
cpuhp_down_callbacks
cpuhp_thread_fun
smpboot_thread_fn
kthread
ret_from_kernel_thread

Which is safe because it's called as part of CPU hotplug, so although
we don't hold the CPU hotplug lock, there is another thread driving
the CPU hotplug operation which does hold the lock, and there is no
race.

Thanks to tglx for deciphering it for us.

Fixes: 3e401f7a2e51 ("powerpc: Only obtain cpu_hotplug_lock if called by rtasd")
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# df7e828c 04-Oct-2017 Kees Cook <keescook@chromium.org>

timer: Remove init_timer_deferrable() in favor of timer_setup()

This refactors the only users of init_timer_deferrable() to use
the new timer_setup() and from_timer(). Removes definition of
init_timer_deferrable().

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David S. Miller <davem@davemloft.net> # for networking parts
Acked-by: Sebastian Reichel <sre@kernel.org> # for drivers/hsi parts
Cc: linux-mips@linux-mips.org
Cc: Petr Mladek <pmladek@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kalle Valo <kvalo@qca.qualcomm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux1394-devel@lists.sourceforge.net
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: linux-s390@vger.kernel.org
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Harish Patil <harish.patil@cavium.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Manish Chopra <manish.chopra@cavium.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-pm@vger.kernel.org
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mark Gross <mark.gross@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-watchdog@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-wireless@vger.kernel.org
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Michael Reed <mdr@sgi.com>
Cc: netdev@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lkml.kernel.org/r/1507159627-127660-6-git-send-email-keescook@chromium.org


# 3e401f7a 20-Jun-2017 Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>

powerpc: Only obtain cpu_hotplug_lock if called by rtasd

Calling arch_update_cpu_topology from a CPU hotplug state machine callback
hits a deadlock because the function tries to get a read lock on
cpu_hotplug_lock while the state machine still holds a write lock on it.

Since all callers of arch_update_cpu_topology except rtasd already hold
cpu_hotplug_lock, this patch changes the function to use
stop_machine_cpuslocked and creates a separate function for rtasd which
still tries to obtain the lock.

Michael Bringmann investigated the bug and provided a detailed analysis
of the deadlock on this previous RFC for an alternate solution:

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michael Bringmann <mwb@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1497996510-4032-1-git-send-email-bauerman@linux.vnet.ibm.com
Link: https://patchwork.ozlabs.org/patch/771293/


# ea614555 06-Apr-2017 Anshuman Khandual <khandual@linux.vnet.ibm.com>

powerpc/mm: Remove reduntant initmem information from log

Generic core VM already prints these information in the log
buffer, hence there is no need for a second print. This just
removes the second print from arch powerpc NUMA init path.

Before the patch:

$ dmesg | grep "Initmem"

numa: Initmem setup node 0 [mem 0x00000000-0xffffffff]
numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff]
numa: Initmem setup node 2 [mem 0x200000000-0x2ffffffff]
numa: Initmem setup node 3 [mem 0x300000000-0x3ffffffff]
numa: Initmem setup node 4 [mem 0x400000000-0x4ffffffff]
numa: Initmem setup node 5 [mem 0x500000000-0x5ffffffff]
numa: Initmem setup node 6 [mem 0x600000000-0x6ffffffff]
numa: Initmem setup node 7 [mem 0x700000000-0x7ffffffff]
Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff]
Initmem setup node 1 [mem 0x0000000100000000-0x00000001ffffffff]
Initmem setup node 2 [mem 0x0000000200000000-0x00000002ffffffff]
Initmem setup node 3 [mem 0x0000000300000000-0x00000003ffffffff]
Initmem setup node 4 [mem 0x0000000400000000-0x00000004ffffffff]
Initmem setup node 5 [mem 0x0000000500000000-0x00000005ffffffff]
Initmem setup node 6 [mem 0x0000000600000000-0x00000006ffffffff]
Initmem setup node 7 [mem 0x0000000700000000-0x00000007ffffffff]

After the patch just the latter set is printed.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# be9ba9ff 01-Feb-2017 Shailendra Singh <shailendras@nvidia.com>

powerpc: Drop GPL from of_node_to_nid() export to match other arches

The generic implementation of of_node_to_nid() is EXPORT_SYMBOL, added
in commit 298535c00a2c ("of, numa: Add NUMA of binding
implementation.").

The powerpc implementation added in commit 953039c8df7b ("[PATCH]
powerpc: Allow devices to register with numa topology") is
EXPORT_SYMBOL_GPL.

This creates an inconsistency for of_node_to_nid() callers across
architectures.

Update the powerpc implementation to be exported consistently with the
generic implementation.

Signed-off-by: Shailendra Singh <shailendras@nvidia.com>
Reviewed-by: Andy Ritger <aritger@nvidia.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2a8628d4 16-Nov-2016 Reza Arbab <arbab@linux.vnet.ibm.com>

powerpc/mm: Allow memory hotplug into an offline node

Relax the check preventing us from hotplugging into an offline node.

This limitation was added in commit 482ec7c403d2 ("[PATCH] powerpc numa:
Support sparse online node map") to prevent adding resources to an
uninitialized node.

These days, there is no harm in doing so. The addition will actually
cause the node to be initialized and onlined; add_memory_resource()
calls hotadd_new_pgdat() (if necessary) and node_set_online().

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 7656cd8e 13-Oct-2016 Reza Arbab <arbab@linux.vnet.ibm.com>

powerpc/mm: Simplify loop control in parse_numa_properties()

The flow of the main loop in parse_numa_properties() is overly
complicated. Simplify it to be less confusing and easier to read.
No functional change.

The end of the main loop in parse_numa_properties() looks like this:

for_each_node_by_type(...) {
...
if (!condition) {
if (--ranges)
goto new_range;
else
continue;
}

statement();

if (--ranges)
goto new_range;
/* else
* continue; <- implicit, this is the end of the loop
*/
}

The only effect of !condition is to skip execution of statement(). This
can be rewritten in a simpler way:

for_each_node_by_type(...) {
...
if (condition)
statement();

if (--ranges)
goto new_range;
}

Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 73c1b41e 21-Dec-2016 Thomas Gleixner <tglx@linutronix.de>

cpu/hotplug: Cleanup state names

When the state names got added a script was used to add the extra argument
to the calls. The script basically converted the state constant to a
string, but the cleanup to convert these strings into meaningful ones did
not happen.

Replace all the useless strings with 'subsys/xxx/yyy:state' strings which
are used in all the other places already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 4a3bac4e 12-Dec-2016 Reza Arbab <arbab@linux.vnet.ibm.com>

powerpc/mm: allow memory hotplug into a memoryless node

Patch series "enable movable nodes on non-x86 configs", v7.

This patchset allows more configs to make use of movable nodes. When
CONFIG_MOVABLE_NODE is selected, there are two ways to introduce such
nodes into the system:

1. Discover movable nodes at boot. Currently this is only possible on
x86, but we will enable configs supporting fdt to do the same.

2. Hotplug and online all of a node's memory using online_movable. This
is already possible on any config supporting memory hotplug, not
just x86, but the Kconfig doesn't say so. We will fix that.

We'll also remove some cruft on power which would prevent (2).

This patch (of 5):

Remove the check which prevents us from hotplugging into an empty node.

The original commit b226e4621245 ("[PATCH] powerpc: don't add memory to
empty node/zone"), states that this was intended to be a temporary measure.
It is a workaround for an oops which no longer occurs.

Link: http://lkml.kernel.org/r/1479160961-25840-2-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alistair Popple <apopple@au1.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8467801c 18-Oct-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc: Fix numa topology console print

With recent update to printk, we get console output like below:

[ 0.550639] Brought up 160 CPUs
[ 0.550718] Node 0 CPUs:
[ 0.550721] 0
[ 0.550754] -39

[ 0.550794] Node 1 CPUs:
[ 0.550798] 40
[ 0.550817] -79

[ 0.550856] Node 16 CPUs:
[ 0.550860] 80
[ 0.550880] -119

[ 0.550917] Node 17 CPUs:
[ 0.550923] 120
[ 0.550942] -159

Fix this by properly using pr_cont(), ie. KERN_CONT.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 08b5e79e 12-Oct-2016 Michael Ellerman <mpe@ellerman.id.au>

powerpc/mm: Drop dump_numa_memory_topology()

At boot we dump the NUMA memory topology in dump_numa_memory_topology(),
at KERN_DEBUG level, resulting in output like:

Node 0 Memory: 0x0-0x100000000
Node 1 Memory: 0x100000000-0x200000000

Which is nice enough, but immediately after that we iterate over each
node and call setup_node_data(), which also prints out the node ranges,
at KERN_INFO, giving eg:

numa: Initmem setup node 0 [mem 0x00000000-0xffffffff]
numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff]

Additionally dump_numa_memory_topology() does not use KERN_CONT
correctly, resulting in split output lines on recent kernels.

So drop dump_numa_memory_topology() as superfluous chatter.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# bdab88e0 18-Jul-2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

powerpc/numa: Convert to hotplug state machine

Install the callbacks via the state machine. On the boot cpu the callback is
invoked manually because cpuhp is not up yet and everything must be
preinitialized before additional CPUs are up.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Cc: Anton Blanchard <anton@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160718140727.GA13132@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 45b64ee6 12-May-2016 Bharata B Rao <bharata@linux.vnet.ibm.com>

powerpc/numa: Fix multiple bugs in memory_hotplug_max()

memory_hotplug_max() uses hot_add_drconf_memory_max() to get maxmimum
addressable memory by referring to ibm,dyanamic-memory property. There
are three problems with the current approach:

1 hot_add_drconf_memory_max() assumes that ibm,dynamic-memory includes
all the LMBs of the guest, but that is not true for PowerKVM which
populates only DR LMBs (LMBs that can be hotplugged/removed) in that
property.
2 hot_add_drconf_memory_max() multiplies lmb-size with lmb-count to arrive
at the max possible address. Since ibm,dynamic-memory doesn't include
RMA LMBs, the address thus obtained will be less than the actual max
address. For example, if max possible memory size is 32G, with lmb-size
of 256MB there can be 127 LMBs in ibm,dynamic-memory (1 LMB for RMA
which won't be present here). hot_add_drconf_memory_max() would then
return the max addressable memory as 127 * 256MB = 31.75GB, the max
address should have been 32G which is what ibm,lrdr-capacity shows.
3 In PowerKVM, there can be a gap between the end of boot time RAM and
beginning of hotplug RAM area. So just multiplying lmb-count with
lmb-size will not provide the correct max possible address for PowerKVM.

This patch fixes 1 by using ibm,lrdr-capacity property to return the max
addressable memory whenever the property is present. Then it fixes 2 & 3
by fetching the address of the last LMB in ibm,dynamic-memory property.

Fixes: cd34206e949b ("powerpc: Add memory_hotplug_max()")
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e70bd3ae 12-May-2016 Bharata B Rao <bharata@linux.vnet.ibm.com>

powerpc/numa: Fix whitespace in hot_add_drconf_memory_max()

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# c118baf8 05-Nov-2015 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes

With the setup_nr_nodes(), we have already initialized
node_possible_map. So it is safe to use for_each_node here.

There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially. That should be reason we had maintained the same for
powerpc.

But, since sparse numa node ids possible on powerpc, we unnecessarily
allocate memory for non existent numa nodes.

For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and
we allocate memory for nodes 2-14. This patch we allocate memory for
only existing numa nodes.

The patch is boot tested on a 4 node tuleta, confirming with printks
that it works as expected.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anton Blanchard <anton@samba.org>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Greg Kurz <gkurz@linux.vnet.ibm.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1def3758 11-Oct-2015 Christophe Jaillet <christophe.jaillet@wanadoo.fr>

powerpc/numa: Use of_get_next_parent to simplify code

of_get_next_parent can be used to simplify the while() loop and
avoid the need of a temp variable.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 1d805440 01-Jul-2015 Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>

powerpc/numa: initialize distance lookup table from drconf path

In some situations, a NUMA guest that supports
ibm,dynamic-memory-reconfiguration node will end up having flat NUMA
distances between nodes. This is because of two problems in the
current code.

1) Different representations of associativity lists.

There is an assumption about the associativity list in
initialize_distance_lookup_table(). Associativity list has two forms:

a) [cpu,memory]@x/ibm,associativity has following
format:
<N> <N integers>

b) ibm,dynamic-reconfiguration-memory/ibm,associativity-lookup-arrays

<M> <N> <M associativity lists each having N integers>
M = the number of associativity lists
N = the number of entries per associativity list

Fix initialize_distance_lookup_table() so that it does not assume
"case a". And update the caller to skip the length field before
sending the associativity list.

2) Distance table not getting updated from drconf path.

Node distance table will not get initialized in certain cases as
ibm,dynamic-reconfiguration-memory path does not initialize the
lookup table.

Call initialize_distance_lookup_table() from drconf path with
appropriate associativity list.

Reported-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 3af229f2 10-Mar-2015 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc/numa: Reset node_possible_map to only node_online_map

Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc ->
for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
value of nr_node_ids in setup_nr_node_ids and the iteration of
for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

mem_cgroup_css_alloc should also be fixed to only iterate over
memory-populated nodes and handle hotplug, but that is a separate
change.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 4b6cfb2a 23-Feb-2015 Greg Kurz <groug@kaod.org>

powerpc/vphn: move VPHN parsing logic to a separate file

The goal behind this patch is to be able to write userland tests for the
VPHN parsing code.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# b1fc9484 23-Feb-2015 Greg Kurz <groug@kaod.org>

powerpc/vphn: move endianness fixing to vphn_unpack_associativity()

The first argument to vphn_unpack_associativity() is a const long *, but the
parsing code expects __be64 values actually. Let's move the endian fixing
down for consistency.

Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 9d0e6d9d 23-Feb-2015 Greg Kurz <groug@kaod.org>

powerpc/vphn: clarify the H_HOME_NODE_ASSOCIATIVITY API

The number of values returned by the H_HOME_NODE_ASSOCIATIVITY h_call deserves
to be explicitly defined, for a better understanding of the code.

Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# f5242e5a 24-Nov-2014 Grant Likely <grant.likely@linaro.org>

of/reconfig: Always use the same structure for notifiers

The OF_RECONFIG notifier callback uses a different structure depending
on whether it is a node change or a property change. This is silly, and
not very safe. Rework the code to use the same data structure regardless
of the type of notifier.

Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
Cc: <linuxppc-dev@lists.ozlabs.org>


# 21098b9e 17-Sep-2014 Anton Blanchard <anton@samba.org>

powerpc: Move sparse_init() into initmem_init

We did part of sparse initialisation in setup_arch and part in
initmem_init. Put them together.

Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 10239733 17-Sep-2014 Anton Blanchard <anton@samba.org>

powerpc: Remove bootmem allocator

At the moment we transition from the memblock alloctor to the bootmem
allocator. Gitting rid of the bootmem allocator removes a bunch of
complicated code (most of which I owe the dubious honour of being
responsible for writing).

Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2c0a33f9 17-Oct-2014 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc/numa: ensure per-cpu NUMA mappings are correct on topology update

We received a report of warning in kernel/sched/core.c where the sched
group was NULL on an LPAR after a topology update. This seems to occur
because after the topology update has moved the CPUs, cpu_to_node is
returning the old value still, which ends up breaking the consistency of
the NUMA topology in the per-cpu maps. Ensure that we update the per-cpu
fields when we re-map CPUs.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 49f8d8c0 17-Oct-2014 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc/numa: use cached value of update->cpu in update_cpu_topology

There isn't any need to keep referring to update->cpu, as we've already
checked cpu == update->cpu at this point.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5c9fb189 14-Oct-2014 Greg Kurz <groug@kaod.org>

powerpc/vphn: NUMA node code expects big-endian

The associativity domain numbers are obtained from the hypervisor through
registers and written into memory by the guest: the packed array passed to
vphn_unpack_associativity() is then native-endian, unlike what was assumed
in the following commit:

commit b08a2a12e44eaec5024b2b969f4fcb98169d1ca3
Author: Alistair Popple <alistair@popple.id.au>
Date: Wed Aug 7 02:01:44 2013 +1000

powerpc: Make NUMA device node code endian safe

This issue fills the topology with bogus data and makes it unusable. It may
lead to severe performance breakdowns.

We should ideally patch the vphn_unpack_associativity() function to do the
64-bit loads, but this requires some more brain storming.

In the meantime, let's go for a suboptimal and temporary bug fix: this patch
converts each 64-bit value of the packed array to big endian, as expected by
the current parsing code in vphn_unpack_associativity().

Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2d73bae1 10-Oct-2014 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc/numa: Add ability to disable and debug topology updates

We have hit a few customer issues with the topology update code (VPHN
and PRRN). It would be nice to be able to debug the notifications coming
from the hypervisor in both cases to the LPAR, as well as to disable
responding to the notifications at boot-time, to narrow down the source
of the problems. Add a basic level of such functionality, similar to the
numa= command-line parameter. We already have a toggle in
/proc/powerpc/topology_updates that allows run-time enabling/disabling,
so the updates can be started at run-time if desired. But the bugs we've
run into have occured during boot or very shortly after coming to login,
and have resulted in a broken NUMA topology.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2d15b9b4 09-Oct-2014 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc/numa: check error return from proc_create

proc_create can fail, we should check the return value and pass up the
failure.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 297cf502 27-Aug-2014 Li Zhong <zhong@linux.vnet.ibm.com>

powerpc: some changes in numa_setup_cpu()

this patches changes some error handling logics in numa_setup_cpu(),
when cpu node is not found, so:

if the cpu is possible, but not present, -1 is kept in numa_cpu_lookup_table,
so later, if the cpu is added, we could set correct numa information for it.

if the cpu is present, then we set the first online node to
numa_cpu_lookup_table instead of 0 ( in case 0 might not be an online node? )

Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# bc3c4327 27-Aug-2014 Li Zhong <zhong@linux.vnet.ibm.com>

powerpc: Only set numa node information for present cpus at boottime

As Nish suggested, it makes more sense to init the numa node informatiion
for present cpus at boottime, which could also avoid WARN_ON(1) in
numa_setup_cpu().

With this change, we also need to change the smp_prepare_cpus() to set up
numa information only on present cpus.

For those possible, but not present cpus, their numa information
will be set up after they are started, as the original code did before commit
2fabf084b6ad.

Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 70ad2375 27-Aug-2014 Li Zhong <zhong@linux.vnet.ibm.com>

powerpc: Fix warning reported by verify_cpu_node_mapping()

With commit 2fabf084b6ad ("powerpc: reorder per-cpu NUMA information's
initialization"), during boottime, cpu_numa_callback() is called
earlier(before their online) for each cpu, and verify_cpu_node_mapping()
uses cpu_to_node() to check whether siblings are in the same node.

It skips the checking for siblings that are not online yet. So the only
check done here is for the bootcpu, which is online at that time. But
the per-cpu numa_node cpu_to_node() uses hasn't been set up yet (which
will be set up in smp_prepare_cpus()).

So I saw something like following reported:
[ 0.000000] CPU thread siblings 1/2/3 and 0 don't belong to the same
node!

As we don't actually do the checking during this early stage, so maybe
we could directly call numa_setup_cpu() in do_init_bootmem().

Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6db35ad2 18-Sep-2014 Scott Wood <scottwood@freescale.com>

powerpc/mm: Use common paging_init() for NUMA

Commit 1c98025c6c95bc057a25e2c6596de23288c68160 "powerpc: Dynamic DMA
zone limits" updated how zones are created in paging_init(), but missed
the NUMA version of paging_init(). This was noticed via a linker
error, since dma_pfn_limit_to_zone() was, like the non-NUMA
paging_init(), limited by #ifndef CONFIG_NEED_MULTIPLE_NODES.

It turns out that the NUMA paging_init() was not actually doing
anything different from the standard paging_init(), other than a couple
debug prints, a couple 32-bit-only ifdef sections, and a call to
mark_nonram_nosave(). It's not clear whether mark_nonram_nosave() is
inherently wrong to do for NUMA, or just not useful on targets that
have NUMA, but for now I'm preserving the existing behavior.

Fixes: 1c98025c6c9 "powerpc: Dynamic DMA zone limits"
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Scott Wood <scottwood@freescale.com>


# 2fabf084 17-Jul-2014 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc: reorder per-cpu NUMA information's initialization

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10

the slab consumption decreases from

Slab: 932416 kB
SUnreclaim: 902336 kB

to

Slab: 395264 kB
SUnreclaim: 359424 kB

And we a corresponding increase in the slab efficiency from

slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%

to

slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b00fc6ec 04-Aug-2014 Andrey Utkin <andrey.krieger.utkin@gmail.com>

powerpc/mm/numa: Fix break placement

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 12c743eb 18-Apr-2014 Mike Qiu <qiudayu@linux.vnet.ibm.com>

powerpc/mm: fix ".__node_distance" undefined

CHK include/config/kernel.release
CHK include/generated/uapi/linux/version.h
CHK include/generated/utsrelease.h
...
Building modules, stage 2.
WARNING: 1 bad relocations
c0000000013d6a30 R_PPC64_ADDR64 uprobes_fetch_type_table
WRAP arch/powerpc/boot/zImage.pseries
WRAP arch/powerpc/boot/zImage.epapr
MODPOST 1849 modules
ERROR: ".__node_distance" [drivers/block/nvme.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
make: *** Waiting for unfinished jobs....

The reason is symbol "__node_distance" not been exported in powerpc.

Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a013361 07-Apr-2014 Michael Wang <wangyun@linux.vnet.ibm.com>

power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

Since v1:
Edited the comment according to Srivatsa's suggestion.

During the testing, we encounter below WARN followed by Oops:

WARNING: at kernel/sched/core.c:6218
...
NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
LR [c000000000101358] .build_sched_domains+0xec8/0x1200
PACATMSCRATCH [800000000000f032]
Call Trace:
[c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
...
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
PACATMSCRATCH [8000000000029032]
Call Trace:
[c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
[c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
...

This was caused by that 'sd->groups == NULL' after building groups, which
was caused by the empty 'sd->span'.

The cpu's domain contained nothing because the cpu was assigned to a wrong
node, due to the following unfortunate sequence of events:

1. The hypervisor sent a topology update to the guest OS, to notify changes
to the cpu-node mapping. However, the update was actually redundant - i.e.,
the "new" mapping was exactly the same as the old one.

2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
the 'for-loop' in arch_update_cpu_topology().

3. So we ended up calling stop-machine() with an empty cpumask list, which made
stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
the cpu to run the payload (the update_cpu_topology() function).

4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
finds update->cpu as well as update->new_nid to be 0. In other words, we
end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.

Along with the following wrong updating, it causes the sched-domain rebuild
code to break and crash the system.

Fix this by skipping the topology update in cases where we find that
the topology has not actually changed in reality (ie., spurious updates).

CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Nathan Fontenot <nfont@linux.vnet.ibm.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Robert Jennings <rcj@linux.vnet.ibm.com>
CC: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
CC: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
CC: Alistair Popple <alistair@popple.id.au>
Suggested-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 316d7188 28-Jan-2014 Joe Perches <joe@perches.com>

powerpc/numa: Fix decimal permissions

This should have been octal.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e7e8de59 21-Jan-2014 Tang Chen <tangchen@cn.fujitsu.com>

memblock: make memblock_set_node() support different memblock_type

[sfr@canb.auug.org.au: fix powerpc build]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: Gong Chen <gong.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68fb18aa 30-Dec-2013 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

powerpc: Add debug checks to catch invalid cpu-to-node mappings

There have been some weird bugs in the past where the kernel tried to associate
threads of the same core to different NUMA nodes, and things went haywire after
that point (as expected).

But unfortunately, root-causing such issues have been quite challenging, due to
the lack of appropriate debug checks in the kernel. These bugs usually lead to
some odd soft-lockups in the scheduler's build-sched-domain code in the CPU
hotplug path, which makes it very hard to trace it back to the incorrect
cpu-to-node mappings.

So add appropriate debug checks to catch such invalid cpu-to-node mappings
as early as possible.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d4edc5b6 30-Dec-2013 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

powerpc: Fix the setup of CPU-to-Node mappings during CPU online

On POWER platforms, the hypervisor can notify the guest kernel about dynamic
changes in the cpu-numa associativity (VPHN topology update). Hence the
cpu-to-node mappings that we got from the firmware during boot, may no longer
be valid after such updates. This is handled using the arch_update_cpu_topology()
hook in the scheduler, and the sched-domains are rebuilt according to the new
mappings.

But unfortunately, at the moment, CPU hotplug ignores these updated mappings
and instead queries the firmware for the cpu-to-numa relationships and uses
them during CPU online. So the kernel can end up assigning wrong NUMA nodes
to CPUs during subsequent CPU hotplug online operations (after booting).

Further, a particularly problematic scenario can result from this bug:
On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
threads per core. The switch to Single-Threaded (ST) mode is performed by
offlining all except the first CPU thread in each core. Switching back to
SMT mode involves onlining those other threads back, in each core.

Now consider this scenario:

1. During boot, the kernel gets the cpu-to-node mappings from the firmware
and assigns the CPUs to NUMA nodes appropriately, during CPU online.

2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
communicates this update to the kernel. The kernel in turn updates its
cpu-to-node associations and rebuilds its sched domains. Everything is
fine so far.

3. Now, the user switches the machine from SMT to ST mode (say, by running
ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
core.

4. The user then tries to switch back from ST to SMT mode (say, by running
ppc64_cpu --smt=4), and this involves onlining those threads back. Since
CPU hotplug ignores the new mappings, it queries the firmware and tries to
associate the newly onlined sibling threads to the old NUMA nodes. This
results in sibling threads within the same core getting associated with
different NUMA nodes, which is incorrect.

The scheduler's build-sched-domains code gets thoroughly confused with this
and enters an infinite loop and causes soft-lockups, as explained in detail
in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings).

So to fix this, use the numa_cpu_lookup_table to remember the updated
cpu-to-node mappings, and use them during CPU hotplug online operations.
Further, we also need to ensure that all threads in a core are assigned to a
common NUMA node, irrespective of whether all those threads were online during
the topology update. To achieve this, we take care not to use cpu_sibling_mask()
since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
and set up the mappings manually using the 'threads_per_core' value for that
particular platform. This helps us ensure that we don't hit this bug with any
combination of CPU hotplug and SMT mode switching.

Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 6408068e 12-Nov-2013 Xishi Qiu <qiuxishi@huawei.com>

mm: use pgdat_end_pfn() to simplify the code in arch

Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages". Simplify the code, no functional change.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ec32dd66 28-Oct-2013 Robert Jennings <rcj@linux.vnet.ibm.com>

powerpc: Fix warnings for arch/powerpc/mm/numa.c

Simple fixes for sparse warnings in this file.

Resolves:
arch/powerpc/mm/numa.c:198:24:
warning: Using plain integer as NULL pointer

arch/powerpc/mm/numa.c:1157:5:
warning: symbol 'hot_add_node_scn_to_nid' was not declared.
Should it be static?

arch/powerpc/mm/numa.c:1238:28:
warning: Using plain integer as NULL pointer

arch/powerpc/mm/numa.c:1538:6:
warning: symbol 'topology_schedule_update' was not declared.
Should it be static?

Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b08a2a12 06-Aug-2013 Alistair Popple <alistair@popple.id.au>

powerpc: Make NUMA device node code endian safe

The device tree is big endian so make sure we byteswap on little
endian. We assume any pHyp calls also return big endian results in
memory.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# f13c13a0 06-Aug-2013 Anton Blanchard <anton@samba.org>

powerpc: Stop using non-architected shared_proc field in lppaca

Although the shared_proc field in the lppaca works today, it is
not architected. A shared processor partition will always have a non
zero yield_count so use that instead. Create a wrapper so users
don't have to know about the details.

In order for older kernels to continue to work on KVM we need
to set the shared_proc bit. While here, remove the ugly bitfield.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 3be7db6a 24-Jul-2013 Robert Jennings <rcj@linux.vnet.ibm.com>

powerpc: VPHN topology change updates all siblings

When an associativity level change is found for one thread, the
siblings threads need to be updated as well. This is done today
for PRRN in stage_topology_update() but is missing for VPHN in
update_cpu_associativity_changes_mask(). This patch will correctly
update all thread siblings during a topology change.

Without this patch a topology update can result in a CPU in
init_sched_groups_power() getting stuck indefinitely in a loop.

This loop is built in build_sched_groups(). As a result of the thread
moving to a node separate from its siblings the struct sched_group will
have its next pointer set to point to itself rather than the sched_group
struct of the next thread. This happens because we have a domain without
the SD_OVERLAP flag, which is correct, and a topology that doesn't conform
with reality (threads on the same core assigned to different numa nodes).
When this list is traversed by init_sched_groups_power() it will reach
the thread's sched_group structure and loop indefinitely; the cpu will
be stuck at this point.

The bug was exposed when VPHN was enabled in commit b7abef0 (v3.9).

Cc: <stable@vger.kernel.org> [v3.9+]
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# dd023217 24-Jun-2013 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/numa: Do not update sysfs cpu registration from invalid context

The topology update code that updates the cpu node registration in sysfs
should not be called while in stop_machine(). The register/unregister
calls take a lock and may sleep.

This patch moves these calls outside of the call to stop_machine().

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 061d19f2 24-Jun-2013 Paul Gortmaker <paul.gortmaker@windriver.com>

powerpc: Delete __cpuinit usage from all users

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the powerpc uses of the __cpuinit macros. There
are no __CPUINIT users in assembly files in powerpc.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 601abdc3 28-Apr-2013 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/pseries: Correct builds break when CONFIG_SMP not defined

Correct build failure for powerpc/pseries builds with CONFIG_SMP not defined.

The function cpu_sibling_mask has no meaning (or definition) when CONFIG_SMP
is not defined. Additionally, the updating of NUMA affinity for a CPU in a UP
system doesn't really make sense.

This patch ifdef's out the code making the affinity updates for PRRN events to
fix the following build break.

arch/powerpc/mm/numa.c: In function ‘stage_topology_update’:
arch/powerpc/mm/numa.c:1535: error: implicit declaration of function ‘cpu_sibling_mask’
arch/powerpc/mm/numa.c:1535: warning: passing argument 3 of ‘cpumask_or’ makes pointer from integer without a cast
make[1]: *** [arch/powerpc/mm/numa.o] Error 1

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 9f3a90e8 28-Apr-2013 Stephen Rothwell <sfr@canb.auug.org.au>

powerpc: Fix build failure after merge of the cgroup tree

After merging the cgroup tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology':
arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration]
arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror]
arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]

Caused by commit 30c05350c39d ("powerpc/pseries: Use stop machine to
update cpu maps") from the powerpc tree interacting with (probably)
commit ff794dea52ea ("cpuset: remove include of cgroup.h from cpuset.h")
from the cgroup tree. Removing includes from header files is fraught
with danger ...

The former should have added an include of linux/slab.h to
arch/powerpc/mm/numa.c.

I have added the following merge fix patch for today (but it should be
applied to the powerpc tree ASAP).

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 29 Apr 2013 14:01:44 +1000
Subject: [PATCH] powerpc: numa.c: using kzalloc/kfree requires including
slab.h

fixes these build errors:

arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology':
arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration]
arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror]
arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# f9d531b8 29-Apr-2013 Cody P Schafer <cody@linux.vnet.ibm.com>

powerpc/mm/numa: use setup_nr_node_ids() instead of opencoding.

[sfr@canb.auug.org.au: add missing semicolon]
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e04fa612 24-Apr-2013 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/pseries: Add /proc interface to control topology updates

There are instances in which we do not want topology updates to occur.
In order to allow this a /proc interface (/proc/powerpc/topology_updates)
is introduced so that topology updates can be enabled and disabled.

This patch also adds a prrn_is_enabled() call so that PRRN events are
handled in the kernel only if topology updating is enabled.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b7abef04 24-Apr-2013 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/pseries: RE-enable Virtual Processor Home Node updating

The new PRRN firmware feature provides a more convenient and event-driven
interface than VPHN for notifying Linux of changes to the NUMA affinity of
platform resources. However, for practical reasons, it may not be feasible
for some customers to update to the latest firmware. For these customers,
the VPHN feature supported on previous firmware versions may still be the
best option.

The VPHN feature was previously disabled due to races with the load
balancing code when accessing the NUMA cpu maps, but the new stop_machine()
approach protects the NUMA cpu maps from these concurrent accesses. It
should be safe to re-enable this feature now.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 176bbf14 24-Apr-2013 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/pseries: Update NUMA VDSO information when updating CPU maps

The following patch adds vdso_getcpu_init(), which stores the NUMA node for
a cpu in SPRG3:

Commit 18ad51dd34 ("powerpc: Add VDSO version of getcpu") adds
vdso_getcpu_init(), which stores the NUMA node for a cpu in SPRG3.

This patch ensures that this information is also updated when the NUMA
affinity of a cpu changes.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 30c05350 24-Apr-2013 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/pseries: Use stop machine to update cpu maps

The new PRRN firmware feature allows CPU and memory resources to be
transparently reassigned across NUMA boundaries. When this happens, the
kernel must update the node maps to reflect the new affinity information.

Although the NUMA maps can be protected by locking primitives during the
update itself, this is insufficient to prevent concurrent accesses to these
structures. Since cpumask_of_node() hands out a pointer to these
structures, they can still be modified outside of the lock. Furthermore,
tracking down each usage of these pointers and adding locks would be quite
invasive and difficult to maintain.

The approach used is to make a list of affected cpus and call stop_machine
to have the update routine run on each of the affected cpus allowing them
to update themselves. Each cpu finds itself in the list of cpus and makes
the appropriate updates. We need to have each cpu do this for themselves to
handle calls to vdso_getcpu_init() added in a subsequent patch.

Situations like these are best handled using stop_machine(). Since the NUMA
affinity updates are exceptionally rare events, this approach has the
benefit of not adding any overhead while accessing the NUMA maps during
normal operation.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 5d88aa85 24-Apr-2013 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/pseries: Update CPU maps when device tree is updated

Platform events such as partition migration or the new PRRN firmware
feature can cause the NUMA characteristics of a CPU to change, and these
changes will be reflected in the device tree nodes for the affected
CPUs.

This patch registers a handler for Open Firmware device tree updates
and reconfigures the CPU and node maps whenever the associativity
changes. Currently, this is accomplished by marking the affected CPUs in
the cpu_associativity_changes_mask and allowing
arch_update_cpu_topology() to retrieve the new associativity information
using hcall_vphn().

Protecting the NUMA cpu maps from concurrent access during an update
operation will be addressed in a subsequent patch in this series.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8002b0c5 23-Apr-2013 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc/pseries: Update numa.c to use updated firmware_has_feature()

Update the numa code to use the updated firmware_has_feature() when checking
for type 1 affinity.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 55671f3c 25-Mar-2013 Stephen Rothwell <sfr@canb.auug.org.au>

powerpc: fix annotation of fake_numa_create_new_node()

This function has always been marked as __cpuinit, but is only called
from functions marked as __init and references an __initdata variable.
So change its annotation to __init.

Fixes this build warning:

WARNING: arch/powerpc/mm/built-in.o(.cpuinit.text+0x86): Section mismatch in reference from the function .fake_numa_create_new_node() to the variable .init.data:cmdline
The function __cpuinit .fake_numa_create_new_node() references
a variable __initdata cmdline.
If cmdline is only used by .fake_numa_create_new_node then
annotate cmdline with a matching annotation.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>


# 7122beee 21-Mar-2013 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

powerpc: fix numa distance for form0 device tree

The following commit breaks numa distance setup for old powerpc
systems that use form0 encoding in device tree.

commit 41eab6f88f24124df89e38067b3766b7bef06ddb
powerpc/numa: Use form 1 affinity to setup node distance

Device tree node /rtas/ibm,associativity-reference-points would
index into /cpus/PowerPCxxxx/ibm,associativity based on form0 or
form1 encoding detected by ibm,architecture-vec-5 property.

All modern systems use form1 and current kernel code is correct.
However, on older systems with form0 encoding, the numa distance
will get hard coded as LOCAL_DISTANCE for all nodes. This causes
task scheduling anomaly since scheduler will skip building numa
level domain (topmost domain with all cpus) if all numa distances
are same. (value of 'level' in sched_init_numa() will remain 0)

Prior to the above commit:
((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)

Restoring compatible behavior with this patch for old powerpc systems
with device tree where numa distance are encoded as form0.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>


# f5949720 02-Oct-2012 Nathan Fontenot <nfont@linux.vnet.ibm.com>

powerpc+of: Move of_drconf_cell struct definition to asm/prom.h

This patch moves the definition of the of_drconf_cell struct to asm/prom.h
to make it available for all powerpc/pseries code.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 79c5fceb 07-Jun-2012 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/vphn: Fix arch_update_cpu_topology() return value

arch_update_cpu_topology() should only return 1 when the topology has
actually changed, and should return 0 otherwise.

This patch fixes a potential bug where rebuild_sched_domains() would
reinitialize the sched domains even when the topology hasn't changed.

Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# aa709f3b 05-Jul-2012 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/numa: Avoid stupid uninitialized warning from gcc

Newer gcc are being a bit blind here (it's pretty obvious we don't
reach the code path using the array if we haven't initialized the
pointer) but none of that is performance critical so let's just
silence it.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 5b958a7e 30-May-2012 Gavin Shan <shangw@linux.vnet.ibm.com>

powerpc/numa: Fix OF node refcounting bug

The form affinity for NUMA is set to 1 if the firmware supports
OPAL. Otherwise, we have to retrieve that from OF node "/chosen".
For the latter case, OF node "/chosen" reference count was never
decreased.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 82b2521d 19-Jun-2012 Michael Neuling <mikey@neuling.org>

powerpc: Fix uninitialised error in numa.c

chroma_defconfig currently gives me this with gcc 4.6:
arch/powerpc/mm/numa.c:638:13: error: 'dm' may be used uninitialized in this function [-Werror=uninitialized]

It's a bogus warning/error since of_get_drconf_memory() only writes it
anyway.

Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: <stable@kernel.org> [v3.3+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# ae3a197e 28-Mar-2012 David Howells <dhowells@redhat.com>

Disintegrate asm/system.h for PowerPC

Disintegrate asm/system.h for PowerPC.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
cc: linuxppc-dev@lists.ozlabs.org


# 9512938b 12-Jan-2012 Wanlong Gao <gaowanlong@cn.fujitsu.com>

cpumask: update setup_node_to_cpumask_map() comments

node_to_cpumask() has been replaced by cpumask_of_node(), and wholly
removed since commit 29c337a0 ("cpumask: remove obsolete node_to_cpumask
now everyone uses cpumask_of_node").

So update the comments for setup_node_to_cpumask_map().

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a25a2fd 21-Dec-2011 Kay Sievers <kay.sievers@vrfy.org>

cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem

This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.

After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.

Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.

Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 2011b1d0 07-Dec-2011 David Rientjes <rientjes@google.com>

powerpc/mm: Fix section mismatch for read_n_cells

read_n_cells() cannot be marked as .devinit.text since it is referenced
from two functions that are not in that section: of_get_lmb_size() and
hot_add_drconf_scn_to_nid().

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 28e86bdb 07-Dec-2011 David Rientjes <rientjes@google.com>

powerpc/mm: Fix section mismatch for mark_reserved_regions_for_nid

mark_reserved_regions_for_nid() is only called from do_init_bootmem(),
which is in .init.text, so it must be in the same section to avoid a
section mismatch warning.

Reported-by: Subrata Modak <subrata@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 1d7cfe18 08-Dec-2011 Tejun Heo <tj@kernel.org>

powerpc: Use HAVE_MEMBLOCK_NODE_MAP

powerpc doesn't access early_node_map[] directly and enabling
HAVE_MEMBLOCK_NODE_MAP is trivial - replacing add_active_range() calls
with memblock_set_node() and selecting HAVE_MEMBLOCK_NODE_MAP is
enough.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>


# 42b2aa86 28-Nov-2011 Justin P. Mattock <justinmattock@gmail.com>

treewide: Fix typos in various parts of the kernel, and fix some comments.

The below patch fixes some typos in various parts of the kernel, as well as fixes some comments.
Please let me know if I missed anything, and I will try to get it changed and resent.

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 1c8ee733 27-Oct-2011 Dipankar Sarma <dipankar@in.ibm.com>

powerpc/numa: NUMA topology support for PowerNV

This patch adds support for numa topology on powernv platforms running
OPAL formware. It checks for the type of platform at run time and
sets the affinity form correctly so that NUMA topology can be discovered
correctly.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 4b16f8e2 22-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

powerpc: various straight conversions from module.h --> export.h

All these files were including module.h just for the basic
EXPORT_SYMBOL infrastructure. We can shift them off to the
export.h header which is a way smaller footprint and thus
realize some compile time gains.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# dfbe93a2 10-Aug-2011 Anton Blanchard <anton@samba.org>

powerpc: Coding style cleanups

While converting code to use for_each_node_by_type I noticed a
number of coding style issues.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 94db7c5e 10-Aug-2011 Anton Blanchard <anton@samba.org>

powerpc: Use for_each_node_by_type instead of open coding it

Use for_each_node_by_type instead of open coding it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 60831842 10-Aug-2011 Anton Blanchard <anton@samba.org>

powerpc/numa: Remove double of_node_put in hot_add_node_scn_to_nid

During memory hotplug testing, I got the following warning:

ERROR: Bad of_node_put() on /memory@0

of_node_release
kref_put
of_node_put
of_find_node_by_type
hot_add_node_scn_to_nid
hot_add_scn_to_nid
memory_add_physaddr_to_nid
...

of_find_node_by_type() loop does the of_node_put for us so we only
need the handle the case where we terminate the loop early.

As suggested by Stephen Rothwell we can do the of_node_put
unconditionally outside of the loop since of_node_put handles a
NULL argument fine.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 5dfe8660 14-Jul-2011 Tejun Heo <tj@kernel.org>

bootmem: Replace work_with_active_regions() with for_each_mem_pfn_range()

Callback based iteration is cumbersome and much less useful than
for_each_*() iterator. This patch implements for_each_mem_pfn_range()
which replaces work_with_active_regions(). All the current users of
work_with_active_regions() are converted.

This simplifies walking over early_node_map and will allow converting
internal logics in page_alloc to use iterator instead of walking
early_node_map directly, which in turn will enable moving node
information to memblock.

powerpc change is only compile tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110714074610.GD3455@htj.dyndns.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 104699c0 27-Apr-2011 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

powerpc: Convert old cpumask API into new one

Adapt new API.

Almost change is trivial. Most important change is the below line
because we plan to change task->cpus_allowed implementation.

- ctx->cpus_allowed = current->cpus_allowed;

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e70606eb 10-Apr-2011 Michael Ellerman <michael@ellerman.id.au>

powerpc/numa: Look for ibm, associativity-reference-points at the root

If we don't find ibm,associativity-reference-points as a child of
/rtas, look for it at the root of the tree instead. We use this on
Book3E where we have no RTAS but still use the sPAPR conventions
for NUMA.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# 36e8695c 09-Mar-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/pseries: Disable VPNH feature

This feature triggers nasty races in the scheduler between the
rebuilding of the topology and the load balancing code, causing
the machine to hang.

Disable it for now until the races are fixed.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 429f4d8d 28-Jan-2011 Anton Blanchard <anton@samba.org>

powerpc/numa: Fix bug in unmap_cpu_from_node

When converting to the new cpumask code I screwed up:

- if (cpu_isset(cpu, numa_cpumask_lookup_table[node])) {
- cpu_clear(cpu, numa_cpumask_lookup_table[node]);
+ if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
+ cpumask_set_cpu(cpu, node_to_cpumask_map[node]);

This was introduced in commit 25863de07af9 (powerpc/cpumask: Convert NUMA code
to new cpumask API)

Fix it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# fe5cfd63 28-Jan-2011 Anton Blanchard <anton@samba.org>

powerpc/numa: Disable VPHN on dedicated processor partitions

There is no need to start up the timer and monitor topology changes on a
dedicated processor partition, so disable it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c0e5e46f 28-Jan-2011 Anton Blanchard <anton@samba.org>

powerpc/numa: Add length when creating OF properties via VPHN

The rest of the NUMA code expects an OF associativity property with
the first cell containing the length. Without this fix all topology changes
cause us to misparse the property and put the cpu into node 0.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d69043e8 28-Jan-2011 Anton Blanchard <anton@samba.org>

powerpc/numa: Check for all VPHN changes

The hypervisor uses unsigned 1 byte counters to signal topology changes to
the OS. Since they can wrap we need to check for any difference, not just if
the hypervisor count is greater than the previous count.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 5de16699 28-Jan-2011 Anton Blanchard <anton@samba.org>

powerpc/numa: Only use active VPHN count fields

VPHN supports up to 8 distance fields but the number of entries in
ibm,associativity-reference-points signifies how many are in use.
Don't look at all the VPHN counts, only distance_ref_points_depth
worth.

Since we already cap our distance metrics at MAX_DISTANCE_REF_POINTS,
use that to size the VPHN arrays and add a BUILD_BUG_ON to avoid it growing
larger than the VPHN maximum of 8.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# cd9d6cc7 20-Jan-2011 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/pseries: Remove unnecessary variable initializations in numa.c

Remove unnecessary variable initializations in VPHN functions.

Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 7639adaa 20-Jan-2011 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/pseries: Fix brace placement in numa.c

Fix brace placement in VPHN code.

Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# bd03403a 20-Jan-2011 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/pseries: Fix typo in VPHN comments

Correct a spelling error in VPHN comments in numa.c.

Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 5d7d8072 11-Jan-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/pseries: Fix build of topology stuff without CONFIG_NUMA

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 39bf990e 17-Dec-2010 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/pseries: Fix VPHN build errors on non-SMP systems

The header asm/hvcall.h was previously included indirectly via
smp.h. On non-SMP systems, however, these declarations are excluded
and the build breaks. This is easily fixed by including asm/hvcall.h
directly.

The VPHN feature is only meaningful on NUMA systems that implement
the SPLPAR option, so exclude the VPHN code on systems without
SPLPAR enabled.

Also, expose unmap_cpu_from_node() on systems with SPLPAR enabled,
even if CONFIG_HOTPLUG_CPU is disabled.

Lastly, map_cpu_to_node() is now needed by VPHN to manipulate the
node masks after boot time, so remove the __cpuinit annotation to
fix a section mismatch.

Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>


# 9eff1a38 30-Nov-2010 Jesse Larrew <jlarrew@linux.vnet.ibm.com>

powerpc/pseries: Poll VPA for topology changes and update NUMA maps

This patch sets a timer during boot that will periodically poll the
associativity change counters in the VPA. When a change in
associativity is detected, it retrieves the new associativity domain
information via the H_HOME_NODE_ASSOCIATIVITY hcall and updates the
NUMA node maps and sysfs entries accordingly. Note that since the
ibm,associativity device tree property does not exist on configurations
with both NUMA and SPLPAR enabled, no device tree updates are necessary.

Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# cd34206e 26-Oct-2010 Nishanth Aravamudan <nacc@us.ibm.com>

powerpc: Add memory_hotplug_max()

Add a function to get the maximum address that can be hotplug added.
This is needed to calculate the size of the tce table needed to cover
all memory in 1:1 mode.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c7fc2de0 12-Oct-2010 Yinghai Lu <yinghai@kernel.org>

memblock, bootmem: Round pfn properly for memory and reserved regions

We need to round memory regions correctly -- specifically, we need to
round reserved region in the more expansive direction (lower limit
down, upper limit up) whereas usable memory regions need to be rounded
in the more restrictive direction (lower limit up, upper limit down).

This introduces two set of inlines:

memblock_region_memory_base_pfn()
memblock_region_memory_end_pfn()
memblock_region_reserved_base_pfn()
memblock_region_reserved_end_pfn()

Although they are antisymmetric (and therefore are technically
duplicates) the use of the different inlines explicitly documents the
programmer's intention.

The lack of proper rounding caused a bug on ARM, which was then found
to also affect other architectures.

Reported-by: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CB4CDFD.4020105@kernel.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 28be7072 03-Aug-2010 Benjamin Herrenschmidt <benh@kernel.crashing.org>

memblock/powerpc: Use new accessors

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 3fdfd990 22-Jul-2010 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Fix erroneous lmb->memblock conversions

Oooops... we missed these. We incorrectly converted strings
used when parsing the device-tree on pseries, thus breaking
access to drconf memory and hotplug memory.

While at it, also revert some variable names that represent
something the FW calls "lmb" and thus don't need to be converted
to "memblock".

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---


# 95f72d1e 11-Jul-2010 Yinghai Lu <yinghai@kernel.org>

lmb: rename to memblock

via following scripts

FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

sed -i \
-e 's/lmb/memblock/g' \
-e 's/LMB/MEMBLOCK/g' \
$FILES

for N in $(find . -name lmb.[ch]); do
M=$(echo $N | sed 's/lmb/memblock/g')
mv $N $M
done

and remove some wrong change like lmbench and dlmb etc.

also move memblock.c from lib/ to mm/

Suggested-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 41eab6f8 16-May-2010 Anton Blanchard <anton@samba.org>

powerpc/numa: Use form 1 affinity to setup node distance

Form 1 affinity allows multiple entries in ibm,associativity-reference-points
which represent affinity domains in decreasing order of importance. The
Linux concept of a node is always the first entry, but using the other
values as an input to node_distance() allows the memory allocator to make
better decisions on which node to go first when local memory has been
exhausted.

We keep things simple and create an array indexed by NUMA node, capped at
4 entries. Each time we lookup an associativity property we initialise
the array which is overkill, but since we should only hit this path during
boot it didn't seem worth adding a per node valid bit.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# bc8449cc 16-May-2010 Anton Blanchard <anton@samba.org>

powerpc/numa: Use ibm,architecture-vec-5 to detect form 1 affinity

I've been told that the architected way to determine we are in form 1
affinity mode is by reading the ibm,architecture-vec-5 property which
mirrors the layout of the fifth vector of the ibm,client-architecture
structure.

Eventually we may want to parse the ibm,architecture-vec-5 and create
FW_FEATURE_* bits.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 25863de0 26-Apr-2010 Anton Blanchard <anton@samba.org>

powerpc/cpumask: Convert NUMA code to new cpumask API

Convert NUMA code to new cpumask API. We shift the node to cpumask
setup code until after we complete bootmem allocation so we can
dynamically allocate the cpumasks.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 4b83c330 07-Apr-2010 Anton Blanchard <anton@samba.org>

powerpc/numa: Add form 1 NUMA affinity

Firmware changed the way it represents memory and cpu affinity on POWER7.
Unfortunately the old method now caps the topology to work around issues
with legacy operating systems. For Linux to get the correct topology we
need to use the new form 1 affinity information.

We set the form 1 field in the client architecture, and if we see "1" in the
ibm,associativity-form property firmware supports form 1 affinity and
we should look at the first field in the ibm,associativity-reference-points
array. If not we use the second field as we always have.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 72c33688 05-Mar-2010 H Hartley Sweeten <hartleys@visionengravers.com>

nodemask.h: remove macro any_online_node

The macro any_online_node() is prone to producing sparse warnings due to
the local symbol 'node'. Since all the in-tree users are really
requesting the first online node (the mask argument is either
NODE_MASK_ALL or node_online_map) just use the first_online_node macro and
remove the any_online_node macro since there are no users.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Milton Miller <miltonm@bga.com>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Geoff Levand <geoffrey.levand@am.sony.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3f6204a 02-Jun-2009 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Set init_bootmem_done on NUMA platforms as well

For some obscure reason, we only set init_bootmem_done after initializing
bootmem when NUMA isn't enabled. We even document this next to the declaration
of that global in system.h which of course I didn't read before I had to
debug why some WIP code wasn't working properly...

This patch changes it so that we always set it after bootmem is initialized
which should have always been the case... go figure !

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 0f16ef7f 17-Feb-2009 Nathan Fontenot <nfont@austin.ibm.com>

powerpc/numa: Cleanup hot_add_scn_to_nid

This patch reworks the hot_add_scn_to_nid and its supporting functions
to make them easier to understand. There are no functional changes in
this patch and has been tested on machine with memory represented in the
device tree as memory nodes and in the ibm,dynamic-memory property.

My previous patch that introduced support for hotplug memory add on
systems whose memory was represented by the ibm,dynamic-memory property
of the device tree only left the code more unintelligible. This
will hopefully makes things easier to understand.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 06eccea6 11-Feb-2009 Dave Hansen <dave@linux.vnet.ibm.com>

powerpc/mm: Fix numa reserve bootmem page selection

Fix the powerpc NUMA reserve bootmem page selection logic.

commit 8f64e1f2d1e09267ac926e15090fd505c1c0cbcb (powerpc: Reserve
in bootmem lmb reserved regions that cross NUMA nodes) changed
the logic for how the powerpc LMB reserved regions were converted
to bootmen reserved regions. As the folowing discussion reports,
the new logic was not correct.

mark_reserved_regions_for_nid() goes through each LMB on the
system that specifies a reserved area. It searches for
active regions that intersect with that LMB and are on the
specified node. It attempts to bootmem-reserve only the area
where the active region and the reserved LMB intersect. We
can not reserve things on other nodes as they may not have
bootmem structures allocated, yet.

We base the size of the bootmem reservation on two possible
things. Normally, we just make the reservation start and
stop exactly at the start and end of the LMB.

However, the LMB reservations are not aware of NUMA nodes and
on occasion a single LMB may cross into several adjacent
active regions. Those may even be on different NUMA nodes
and will require separate calls to the bootmem reserve
functions. So, the bootmem reservation must be trimmed to
fit inside the current active region.

That's all fine and dandy, but we trim the reservation
in a page-aligned fashion. That's bad because we start the
reservation at a non-page-aligned address: physbase.

The reservation may only span 2 bytes, but that those bytes
may span two pfns and cause a reserve_size of 2*PAGE_SIZE.

Take the case where you reserve 0x2 bytes at 0x0fff and
where the active region ends at 0x1000. You'll jump into
that if() statment, but node_ar.end_pfn=0x1 and
start_pfn=0x0. You'll end up with a reserve_size=0x1000,
and then call

reserve_bootmem_node(node, physbase=0xfff, size=0x1000);

0x1000 may not be on the same node as 0xfff. Oops.

In almost all the vm code, end_<anything> is not inclusive.
If you have an end_pfn of 0x1234, page 0x1234 is not
included in the range. Using PFN_UP instead of the
(>> >> PAGE_SHIFT) will make this consistent with the other VM
code.

We also need to do math for the reserved size with physbase
instead of start_pfn. node_ar.end_pfn << PAGE_SHIFT is
*precisely* the end of the node. However,
(start_pfn << PAGE_SHIFT) is *NOT* precisely the beginning
of the reserved area. That is, of course, physbase.
If we don't use physbase here, the reserve_size can be
made too large.

From: Dave Hansen <dave@linux.vnet.ibm.com>
Tested-by: Geoff Levand <geoffrey.levand@am.sony.com> Tested on PS3.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8b16cd23 07-Jan-2009 Milton Miller <miltonm@bga.com>

powerpc/numa: Remove redundant find_cpu_node()

Use of_get_cpu_node, which is a superset of numa.c's find_cpu_node in
a less restrictive section (text vs cpuinit).

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 20fcefe5 07-Jan-2009 Milton Miller <miltonm@bga.com>

powerpc/numa: Avoid possible reference beyond prop. length in find_min_common_depth()

find_min_common_depth() was checking the property length incorrectly.
The value is in bytes not cells, and it is using the second entry.

Signed-off-By: Milton Miller <miltonm@bga.com>

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 893473df 09-Dec-2008 Dave Hansen <dave@linux.vnet.ibm.com>

powerpc/mm: Cleanup careful_allocation(): consolidate memset()

Both users of careful_allocation() immediately memset() the
result. So, just do it in one place.

Also give careful_allocation() a 'z' prefix to bring it in
line with kzmalloc() and friends.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 0be210fd 09-Dec-2008 Dave Hansen <dave@linux.vnet.ibm.com>

powerpc/mm: Make careful_allocation() return virtual addrs

Since we memset() the result in both of the uses here,
just make careful_alloc() return a virtual address.
Also, add a separate variable to store the physial
address that comes back from the lmb_alloc() functions.
This makes it less likely that someone will screw it up
forgetting to convert before returning since the vaddr
is always in a void* and the paddr is always in an
unsigned long.

I admit this is arbitrary since one of its users needs
a paddr and one a vaddr, but it does remove a good
number of casts.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 5d21ea2b 09-Dec-2008 Dave Hansen <dave@linux.vnet.ibm.com>

powerpc/mm:: Cleanup careful_allocation(): bootmem already panics

If we fail a bootmem allocation, the bootmem code itself
panics. No need to redo it here.

Also change the wording of the other panic. We don't
strictly have to allocate memory on the specified node.
It is just a hint and that node may not even *have* any
memory on it. In that case we can and do fall back to
other nodes.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c555e520 09-Dec-2008 Dave Hansen <dave@linux.vnet.ibm.com>

powerpc/mm: Add better comment on careful_allocation()

The behavior in careful_allocation() really confused me
at first. Add a comment to hopefully make it easier
on the next doofus that looks at it.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# a4c74ddd 11-Dec-2008 Dave Hansen <dave@linux.vnet.ibm.com>

powerpc: Fix bootmem reservation on uninitialized node

careful_allocation() was calling into the bootmem allocator for
nodes which had not been fully initialized and caused a previous
bug: http://patchwork.ozlabs.org/patch/10528/ So, I merged a
few broken out loops in do_init_bootmem() to fix it. That changed
the code ordering.

I think this bug is triggered by having reserved areas for a node
which are spanned by another node's contents. In the
mark_reserved_regions_for_nid() code, we attempt to reserve the
area for a node before we have allocated the NODE_DATA() for that
nid. We do this since I reordered that loop. I suck.

This is causing crashes at bootup on some systems, as reported
by Jon Tollefson.

This may only present on some systems that have 16GB pages
reserved. But, it can probably happen on any system that is
trying to reserve large swaths of memory that happen to span other
nodes' contents.

This commit ensures that we do not touch bootmem for any node which
has not been initialized, and also removes a compile warning about
an unused variable.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 4a618669 23-Nov-2008 Dave Hansen <dave@linux.vnet.ibm.com>

powerpc: Fix boot freeze on machine with empty memory node

I got a bug report about a distro kernel not booting on a particular
machine. It would freeze during boot:

> ...
> Could not find start_pfn for node 1
> [boot]0015 Setup Done
> Built 2 zonelists in Node order, mobility grouping on. Total pages: 123783
> Policy zone: DMA
> Kernel command line:
> [boot]0020 XICS Init
> [boot]0021 XICS Done
> PID hash table entries: 4096 (order: 12, 32768 bytes)
> clocksource: timebase mult[7d0000] shift[22] registered
> Console: colour dummy device 80x25
> console handover: boot [udbg0] -> real [hvc0]
> Dentry cache hash table entries: 1048576 (order: 7, 8388608 bytes)
> Inode-cache hash table entries: 524288 (order: 6, 4194304 bytes)
> freeing bootmem node 0

I've reproduced this on 2.6.27.7. It is caused by commit
8f64e1f2d1e09267ac926e15090fd505c1c0cbcb ("powerpc: Reserve in bootmem
lmb reserved regions that cross NUMA nodes").

The problem is that Jon took a loop which was (in pseudocode):

for_each_node(nid)
NODE_DATA(nid) = careful_alloc(nid);
setup_bootmem(nid);
reserve_node_bootmem(nid);

and broke it up into:

for_each_node(nid)
NODE_DATA(nid) = careful_alloc(nid);
setup_bootmem(nid);
for_each_node(nid)
reserve_node_bootmem(nid);

The issue comes in when the 'careful_alloc()' is called on a node with
no memory. It falls back to using bootmem from a previously-initialized
node. But, bootmem has not yet been reserved when Jon's patch is
applied. It gives back bogus memory (0xc000000000000000) and pukes
later in boot.

The following patch collapses the loop back together. It also breaks
the mark_reserved_regions_for_nid() code out into a function and adds
some comments. I think a huge part of introducing this bug is because
for loop was too long and hard to read.

The actual bug fix here is the:

+ if (end_pfn <= node->node_start_pfn ||
+ start_pfn >= node_end_pfn)
+ continue;

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# fe55249d 20-Oct-2008 Milton Miller <miltonm@bga.com>

powerpc: Always trim numa memory to lmb_end_of_DRAM()

numa_enforce_memory_limit tried to be smart and only call lmb_end_of_DRAM
when a memory limit was set via mem= on the command line. However,
the early boot code will also limit memory added to the lmb system
when iommu=off is specified. When this happens, the page allocator
is given pages not in the linear mapping and this results in a fatal
data reference to the unmapped page.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e8170372 16-Oct-2008 Jon Tollefson <kniht@us.ibm.com>

powerpc/numa: Make memory reserve code more robust

Adjust amount to reserve based on previous nodes for reserves spanning
multiple nodes. Check if the node active range is empty before attempting
to pass the reserve to bootmem. In practice the range shouldn't be empty,
but to be sure we check.

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8f64e1f2 09-Oct-2008 Jon Tollefson <kniht@linux.vnet.ibm.com>

powerpc: Reserve in bootmem lmb reserved regions that cross NUMA nodes

If there are multiple reserved memory blocks via lmb_reserve() that are
contiguous addresses and on different NUMA nodes we are losing track of which
address ranges to reserve in bootmem on which node. I discovered this
when I recently got to try 16GB huge pages on a system with more then 2 nodes.

When scanning the device tree in early boot we call lmb_reserve() with
the addresses of the 16G pages that we find so that the memory doesn't
get used for something else. For example the addresses for the pages
could be 4000000000, 4400000000, 4800000000, 4C00000000, etc - 8 pages,
one on each of eight nodes. In the lmb after all the pages have been
reserved it will look something like the following:

lmb_dump_all:
memory.cnt = 0x2
memory.size = 0x3e80000000
memory.region[0x0].base = 0x0
.size = 0x1e80000000
memory.region[0x1].base = 0x4000000000
.size = 0x2000000000
reserved.cnt = 0x5
reserved.size = 0x3e80000000
reserved.region[0x0].base = 0x0
.size = 0x7b5000
reserved.region[0x1].base = 0x2a00000
.size = 0x78c000
reserved.region[0x2].base = 0x328c000
.size = 0x43000
reserved.region[0x3].base = 0xf4e8000
.size = 0xb18000
reserved.region[0x4].base = 0x4000000000
.size = 0x2000000000

The reserved.region[0x4] contains the 16G pages. In
arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the
node numbers looking for the reserved regions that belong to the
particular node. It is not able to identify region 0x4 as being a part
of each of the 8 nodes. It is assuming that a reserved region is only
on a single node.

This patch takes out the reserved region loop from inside
the loop that goes over each node. It looks up the active region containing
the start of the reserved region. If it extends past that active region then
it adjusts the size and gets the next active region containing it.

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# cf00085d 29-Aug-2008 Chandru <chandru@in.ibm.com>

powerpc: Add support for dynamic reconfiguration memory in kexec/kdump kernels

Kdump kernel needs to use only those memory regions that it is allowed
to use (crashkernel, rtas, tce, etc.). Each of these regions have
their own sizes and are currently added under 'linux,usable-memory'
property under each memory@xxx node of the device tree.

The ibm,dynamic-memory property of ibm,dynamic-reconfiguration-memory
node (on POWER6) now stores in it the representation for most of the
logical memory blocks with the size of each memory block being a
constant (lmb_size). If one or more or part of the above mentioned
regions lie under one of the lmb from ibm,dynamic-memory property,
there is a need to identify those regions within the given lmb.

This makes the kernel recognize a new 'linux,drconf-usable-memory'
property added by kexec-tools. Each entry in this property is of the
form of a count followed by that many (base, size) pairs for the above
mentioned regions. The number of cells in the count value is given by
the #size-cells property of the root node.

Signed-off-by: Chandru Siddalingappa <chandru@in.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# b61bfa3c 23-Jul-2008 Johannes Weiner <hannes@saeurebad.de>

mm: move bootmem descriptors definition to a single place

There are a lot of places that define either a single bootmem descriptor or an
array of them. Use only one central array with MAX_NUMNODES items instead.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0db9360a 02-Jul-2008 Nathan Fontenot <nfont@austin.ibm.com>

powerpc/pseries: Update numa association of hotplug memory add for drconf memory

Update the association of a memory section with a numa node that
occurs during hotplug add of a memory section. This adds a check in
the hot_add_scn_to_nid() routine for the
ibm,dynamic-reconfiguration-memory node in the device tree. If
present the new hot_add_drconf_scn_to_nid() routine is invoked, which
can properly parse the ibm,dynamic-reconfiguration-memory node of the
device tree and make the proper numa node associations.

This also introduces the valid_hot_add_scn() routine as a helper
function for code that is common to the hot_add_scn_to_nid() and
hot_add_drconf_scn_to_nid() routines.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 8342681d 02-Jul-2008 Nathan Fontenot <nfont@austin.ibm.com>

powerpc/pseries: Split code into helper routines for drconf memory

This splits off several pieces of code that parse the
ibm,dynamic-reconfiguration-memory node of the device tree into separate
helper routines. This is in preparation for the next commit that will
use these helper routines. There are no functional changes in this patch.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6df1646e 13-Feb-2008 Michael Ellerman <michael@ellerman.id.au>

[POWERPC] Add include of linux/of.h to numa.c

numa.c requires routines declared in linux/of.h, so should include it.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# d9b2b2a2 13-Feb-2008 David S. Miller <davem@davemloft.net>

[LIB]: Make PowerPC LMB code generic so sparc64 can use it too.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 72a7fe39 07-Feb-2008 Bernhard Walle <bwalle@suse.de>

Introduce flags for reserve_bootmem()

This patchset adds a flags variable to reserve_bootmem() and uses the
BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions
between crashkernel area and already used memory.

This patch:

Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE.
If that flag is set, the function returns with -EBUSY if the memory already
has been reserved in the past. This is to avoid conflicts.

Because that code runs before SMP initialisation, there's no race condition
inside reserve_bootmem_core().

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix powerpc build]
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: <linux-arch@vger.kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1daa6d08 31-Jan-2008 Balbir Singh <balbir@linux.vnet.ibm.com>

[POWERPC] Fake NUMA emulation for PowerPC

Here's a dumb simple implementation of fake NUMA nodes for PowerPC.
Fake NUMA nodes can be specified using the following command line
option

numa=fake=<node range>

node range is of the format <range1>,<range2>,...<rangeN>

Each of the rangeX parameters is passed using memparse(). I find the
patch useful for fake NUMA emulation on my simple PowerPC machine.
I've tested it on a numa box with the following arguments

numa=fake=512M
numa=fake=512M,768M
numa=fake=256M,512M mem=512M
numa=fake=1G mem=768M
numa=fake=
without any numa= argument

The other side-effect introduced by this patch is that; in the case
where we don't have NUMA information, we now set a node online after
adding each LMB. This node could very well be node 0, but in the case
that we enable fake NUMA nodes, when we cross node boundaries, we need
to set the new node online.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 55852bed 25-Jan-2008 Paul Mackerras <paulus@samba.org>

Revert "[POWERPC] Fake NUMA emulation for PowerPC"

This reverts commit 5c3f5892a2db6757a72ce8b27cba90db06683e1d,
basically because it changes behaviour even when no fake NUMA
information is specified on the kernel command line.

Firstly, it changes the nid, thus destroying the real NUMA
information. Secondly, it also changes behaviour in that if a node
ends up with no memory in it because of the memory limit, we used to
set it online and now we don't.

Also, in the non-NUMA case with no fake NUMA information, we do
node_set_online once for each LMB now, whereas previously we only did
it once. I don't know if that is actually a problem, but it does seem
unnecessary.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 5c3f5892 07-Dec-2007 Balbir Singh <balbir@linux.vnet.ibm.com>

[POWERPC] Fake NUMA emulation for PowerPC

Here's a dumb simple implementation of fake NUMA nodes for PowerPC.
Fake NUMA nodes can be specified using the following command line option

numa=fake=<node range>

node range is of the format <range1>,<range2>,...<rangeN>

Each of the rangeX parameters is passed using memparse(). I find this
useful for fake NUMA emulation on my simple PowerPC machine. I've
tested it on a non-numa box with the following arguments:

numa=fake=1G
numa=fake=1G,2G
name=fake=1G,512M,2G
numa=fake=1500M,2800M mem=3500M
numa=fake=1G mem=512M
numa=fake=1G mem=1G

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# b9c3fdb0 31-Jul-2007 Michael Ellerman <michael@ellerman.id.au>

[POWERPC] Fix parse_drconf_memory() for 64-bit start addresses

Some new machines use the "ibm,dynamic-reconfiguration-memory" property
to provide memory layout information, rather than via memory nodes.

There is a bug in the code to parse this property for start addresses
over 4GB; we store the start address in an unsigned int, which means
we throw away the high bits and add apparently duplicate regions.
This results in a BUG() in free_bootmem_core(). This fixes it by
using an unsigned long instead.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 8bb78442 09-May-2007 Rafael J. Wysocki <rjw@rjwysocki.net>

Add suspend-related notifications for CPU hotplug

Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2eb6392 03-Apr-2007 Stephen Rothwell <sfr@canb.auug.org.au>

[POWERPC] Rename get_property to of_get_property: arch/powerpc

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 9213feea 02-Apr-2007 Stephen Rothwell <sfr@canb.auug.org.au>

[POWERPC] Rename prom_n_size_cells to of_n_size_cells

This is more consistent and gets us closer to the Sparc code.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# a8bda5dd 02-Apr-2007 Stephen Rothwell <sfr@canb.auug.org.au>

[POWERPC] Rename prom_n_addr_cells to of_n_addr_cells

This is more consistent and gets us closer to the Sparc code.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 1b3c3714 17-Feb-2007 Uwe Kleine-König <zeisberg@informatik.uni-freiburg.de>

Fix typos concerning hierarchy

heirarchical, hierachical -> hierarchical
heirarchy, hierachy -> hierarchy

Signed-off-by: Uwe Kleine-König <zeisberg@informatik.uni-freiburg.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 0204568a 29-Nov-2006 Paul Mackerras <paulus@samba.org>

[POWERPC] Support ibm,dynamic-reconfiguration-memory nodes

For PAPR partitions with large amounts of memory, the firmware has an
alternative, more compact representation for the information about the
memory in the partition and its NUMA associativity information. This
adds the code to the kernel to parse this alternative representation.

The other part of this patch is telling the firmware that we can
handle the alternative representation. There is however a subtlety
here, because the firmware will invoke a reboot if the memory
representation we request is different from the representation that
firmware is currently using. This is because firmware can't change
the representation on the fly. Further, some firmware versions used
on POWER5+ machines have a bug where this reboot leaves the machine
with an altered value of load-base, which will prevent any kernel
booting until it is reset to the normal value (0x4000). Because of
this bug, we do NOT set fake_elf.rpanote.new_mem_def = 1, and thus we
do not request the new representation on POWER5+ and earlier machines.
We do request the new representation on POWER6, which uses the
ibm,client-architecture-support call.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6391af17 11-Oct-2006 Mel Gorman <mel@csn.ul.ie>

[PATCH] mm: use symbolic names instead of indices for zone initialisation

Arch-independent zone-sizing is using indices instead of symbolic names to
offset within an array related to zones (max_zone_pfns). The unintended
impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead
of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set. As a result, the
the machine fails to boot but will boot with CONFIG_HIGHMEM turned off.

The following patch properly initialises the max_zone_pfns[] array and uses
symbolic names instead of indices in each architecture using
arch-independent zone-sizing. Two users have successfully booted their
powerpcs with it (one an ibook G4). It has also been boot tested on x86,
x86_64, ppc64 and ia64. Please merge for 2.6.19-rc2.

Credit to Benjamin Herrenschmidt for identifying the bug and rolling the
first fix. Additional credit to Johannes Berg and Andreas Schwab for
reporting the problem and testing on powerpc.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c67c3cb4 27-Sep-2006 Mel Gorman <mel@csn.ul.ie>

[PATCH] Have Power use add_active_range() and free_area_init_nodes()

Size zones and holes in an architecture independent manner for Power.

[judith@osdl.org: build fix]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a7f67bdf 11-Jul-2006 Jeremy Kerr <jk@ozlabs.org>

[POWERPC] Constify & voidify get_property()

Now that get_property() returns a void *, there's no need to cast its
return value. Also, treat the return value as const, so we can
constify get_property later.

powerpc core changes.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 74b85f37 27-Jun-2006 Chandra Seetharaman <sekharan@us.ibm.com>

[PATCH] cpu hotplug: make cpu_notifier related notifier blocks __cpuinit only

Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 953039c8 01-May-2006 Jeremy Kerr <jk@ozlabs.org>

[PATCH] powerpc: Allow devices to register with numa topology

Change of_node_to_nid() to traverse the device tree, looking for a numa id.
Cell uses this to assign ids to SPUs, which are children of the CPU node.
Existing users of of_node_to_nid() are altered to use of_node_to_nid_single(),
which doesn't do the traversal.

Export an attach_sysdev_to_node() function, allowing system devices (eg.
SPUs) to link themselves into the numa topology in sysfs.

Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e110b281 12-Apr-2006 Olof Johansson <olof@lixom.net>

[PATCH] powerpc: Less verbose mem configuration output

Quieten some of the debug ram config output. we already print out available
memory at KERN_INFO level.

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 069007ae 24-Mar-2006 Andrew Morton <akpm@osdl.org>

[PATCH] powerpc: hot_add_scn_to_nid() build fix

The return statement is to prevent `warning: 'nid' might be used uninitialized
in this function'.

Cc: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 2b261227 20-Mar-2006 Nathan Lynch <nathanl@austin.ibm.com>

[PATCH] powerpc numa: Consolidate assignment of cpus to nodes

We can plug the boot cpu into its node independently of whether numa
topology is detected. And numa_setup_cpu does the right thing for all
cases now, so remove special-casing for non-numa from the cpu hotplug
callback.

Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 482ec7c4 20-Mar-2006 Nathan Lynch <nathanl@austin.ibm.com>

[PATCH] powerpc numa: Support sparse online node map

The powerpc numa code unconditionally onlines all nodes from 0 to the
highest node id found, regardless of whether cpus or memory are
present in the nodes. This wastes 8K per node and complicates some
cpu and memory hotplug situations, such as adding a resource that
doesn't map to one of the nodes discovered at boot.

Set nodes online as resources are scanned. Fall back to node 0 only
when we're sure this isn't a NUMA machine.

Instead of defaulting to node 0 for cases of hot-adding a resource
which doesn't belong to any initialized node, assign it to the first
online node.

Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# bc16a759 20-Mar-2006 Nathan Lynch <nathanl@austin.ibm.com>

[PATCH] powerpc numa: Consolidate handling of Power4 special case

Code to handle Power4's invalid node id (0xffff) is duplicated for cpu
and memory. Better to handle this case in one place --
of_node_to_nid. Overall behavior should be unchanged.

Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# cf950b7a 20-Mar-2006 Nathan Lynch <nathanl@austin.ibm.com>

[PATCH] powerpc numa: Get rid of "numa domain" terminology

Since we effectively treat the domain ids given to us by firmare as
logical node ids, make this explicit (basically s/numa_domain/nid/).

No functional changes, only variable and function names are modified.

Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 2e5ce39d 20-Mar-2006 Nathan Lynch <nathanl@austin.ibm.com>

[PATCH] powerpc numa: Minor cpu hotplug-related cleanups

map_cpu_to_node does not need to be inline, it is never called in a
hot path.

map_cpu_to_node, numa_setup_cpu, and find_cpu_node can be marked
__cpuinit, as they are never used after boot if CONFIG_HOTPLUG_CPU=n.

Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# bf4b85b0 20-Mar-2006 Nathan Lynch <nathanl@austin.ibm.com>

[PATCH] powerpc numa: Minor debugging code changes

Add debug statement for map_cpu_to_node; it's useful for cpu hotplug.

Clarify debug statement about not finding the numa reference points
property.

Don't print a meaningless associativity depth (-1) on non-numa systems.

Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# c08888cf 20-Mar-2006 Nathan Lynch <nathanl@austin.ibm.com>

[PATCH] powerpc numa: fix boot_cpuid always assigned to node 0

At boot, the numa code is assigning boot_cpuid to node 0
unconditionally. Basically, numa_setup_cpu is being stupid about it,
but this is the minimal fix -- just call numa_setup_cpu(boot_cpuid)
later, after all nodes have been set online.

Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# d7a5b2ff 25-Jan-2006 Michael Ellerman <michael@ellerman.id.au>

[PATCH] powerpc: Always panic if lmb_alloc() fails

Currently most callers of lmb_alloc() don't check if it worked or not, if it
ever does weird bad things will probably happen. The few callers who do check
just panic or BUG_ON.

So make lmb_alloc() panic internally, to catch bugs at the source. The few
callers who did check the result no longer need to.

The only caller that did anything interesting with the return result was
careful_allocation(). For it we create __lmb_alloc_base() which _doesn't_ panic
automatically, a little messy, but passable.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# b226e462 16-Dec-2005 Mike Kravetz <kravetz@us.ibm.com>

[PATCH] powerpc: don't add memory to empty node/zone

The system will oops if an attempt is made to add memory to an
empty node/zone. This patch prevents adding memory to an empty
node. The code to dynamically add a node/zone is non-trivial.
This patch is temporary and will be removed when the ability
to dynamically add a node/zone is complete.

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# cc5d0189 13-Dec-2005 Benjamin Herrenschmidt <benh@kernel.crashing.org>

[PATCH] powerpc: Remove device_node addrs/n_addr

The pre-parsed addrs/n_addrs fields in struct device_node are finally
gone. Remove the dodgy heuristics that did that parsing at boot and
remove the fields themselves since we now have a good replacement with
the new OF parsing code. This patch also fixes a bunch of drivers to use
the new code instead, so that at least pmac32, pseries, iseries and g5
defconfigs build.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 4b703a23 12-Dec-2005 Anton Blanchard <anton@samba.org>

[PATCH] ppc64: Add NUMA cpu summary at boot

We used to print a NUMA cpu summary at boot before the hotplug cpu code
was added. This has been useful for catching machine configuration as
well as firmware bugs in the past.

This patch restores that functionality. An example of the output is:

Node 0 CPUs: 0-7
Node 1 CPUs: 8-15

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# ba759485 04-Dec-2005 Michael Ellerman <michael@ellerman.id.au>

[PATCH] powerpc: Add support for "linux,usable-memory" on memory nodes

Milton has proposed that we should support a "linux,usable-memory" property
on memory nodes which describes, in preference to "reg", the regions of memory
Linux should use.

This facility is required for kdump to inform the second kernel which memory
it should use.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 237a0989 05-Dec-2005 Mike Kravetz <kravetz@us.ibm.com>

[PATCH] powerpc: numa placement for dynamically added memory

This places dynamically added memory within the appropriate
numa node. A new routine hot_add_scn_to_nid() replicates most of
the memory scanning code in parse_numa_properties().

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 84c9fdd1 30-Nov-2005 Mike Kravetz <kravetz@us.ibm.com>

[PATCH] powerpc: Minor numa memory code cleanup

Here is an updated version of the patch that panics if no memory is
found as Nathan suggested. I'm still concerned that panic strings
(not just the one added here) at this stage of booting do not show
up on my system. But, that is an issue separate from this patch.

Combine get_mem_*_cells() routines to avoid multiple memory node
lookups. Added missing of_node_put() call. Changed variable names
to help with some confusion as to meaning.

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 54c23310 04-Dec-2005 Paul Mackerras <paulus@samba.org>

Revert "[PATCH] powerpc: Minor numa memory code cleanup"

This reverts f1fdc0117004d343698b9830e141491d5ae320d1 commit.


# 74761bb5 28-Nov-2005 Mike Kravetz <kravetz@us.ibm.com>

[PATCH] powerpc: Minor numa memory code cleanup

I started to add missing of_node_put() calls to the routines that
determine the number of cells for memory. Decided to combine the
routines instead of making separate node lookups. Changed variable
names to help with some confusion as to meaning.

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6d91bb93 07-Dec-2005 Mike Kravetz <kravetz@us.ibm.com>

[PATCH] powerpc/pseries: boot failures on numa if no memory on node

This bug exists in the current code and prevents machines from booting
with numa enabled if there is a node that does not contain memory.
Workaround is to boot with 'numa=off'. Looks like a simple typo.

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# fb6d73d3 15-Nov-2005 Paul Mackerras <paulus@samba.org>

[PATCH] powerpc: Fix sparsemem with memory holes [was Re: ppc64 oops..]

This patch should fix the crashes we have been seeing on 64-bit
powerpc systems with a memory hole when sparsemem is enabled.
I'd appreciate it if people who know more about NUMA and sparsemem
than me could look over it.

There were two bugs. The first was that if NUMA was enabled but there
was no NUMA information for the machine, the setup_nonnuma() function
was adding a single region, assuming memory was contiguous. The
second was that the loops in mem_init() and show_mem() assumed that
all pages within the span of a pgdat were valid (had a valid struct
page).

I also fixed the incorrect setting of num_physpages that Mike Kravetz
pointed out.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 45fb6cea 10-Nov-2005 Anton Blanchard <anton@samba.org>

[PATCH] ppc64: Convert NUMA to sparsemem (3)

Convert to sparsemem and remove all the discontigmem code in the
process. This has a few advantages:

- The old numa_memory_lookup_table can go away
- All the arch specific discontigmem magic can go away

We also remove the triple pass of memory properties and instead create a
list of per node extents that we iterate through. A final cleanup would
be to change our lmb code to store extents per node, then we can reuse
that information in the numa code.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 3e66c4de 10-Nov-2005 Anton Blanchard <anton@samba.org>

[PATCH] ppc64: prep for NUMA sparsemem rework 2

Remove ppc64 specific version of nr_cpus_node and use the generic one
provided.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 2249ca9d 06-Nov-2005 Paul Mackerras <paulus@samba.org>

powerpc: Various UP build fixes

Mostly this involves adding #include <asm/smp.h>, since that defines
things like boot_cpuid[_phys] and [gs]et_hard_smp_processor_id, which
are SMP-related but still needed on UP. This incorporates fixes
posted by Olof Johansson and Heikki Lindholm.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# cf00a8d1 30-Oct-2005 Paul Mackerras <paulus@samba.org>

powerpc: Fix bug arising from having multiple memory_limit variables

We had a static memory_limit in prom.c, and then another one defined
in setup_64.c and used in numa.c, which resulted in the kernel crashing
when mem=xxx was given on the command line. This puts the declaration
in system.h and the definition in mem.c. This also moves the
definition of tce_alloc_start/end out of setup_64.c.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# ab1f9dac 10-Oct-2005 Paul Mackerras <paulus@samba.org>

powerpc: Merge arch/ppc64/mm to arch/powerpc/mm

This moves the remaining files in arch/ppc64/mm to arch/powerpc/mm,
and arranges that we use them when compiling with ARCH=ppc64.

Signed-off-by: Paul Mackerras <paulus@samba.org>