History log of /linux-master/arch/powerpc/include/asm/book3s/64/pgalloc.h
Revision Date Author Comments
# a5edf981 26-Sep-2022 Nicholas Miehlbradt <nicholas@linux.ibm.com>

powerpc/64s: Enable KFENCE on book3s64

KFENCE support was added for ppc32 in commit 90cbac0e995d
("powerpc: Enable KFENCE for PPC32").
Enable KFENCE on ppc64 architecture with hash and radix MMUs.
It uses the same mechanism as debug pagealloc to
protect/unprotect pages. All KFENCE kunit tests pass on both
MMUs.

KFENCE memory is initially allocated using memblock but is
later marked as SLAB allocated. This necessitates the change
to __pud_free to ensure that the KFENCE pages are freed
appropriately.

Based on previous work by Christophe Leroy and Jordan Niethe.

Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926075726.2846-4-nicholas@linux.ibm.com


# 645d5ce2 09-Jul-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

We can hit the following BUG_ON during memory unplug:

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
LR [c00000000147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc000008050000000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74

This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.

This can be fixed by allocating memory in PAGE_SIZE granularity
during early page table allocation. This makes sure a specific
page is not shared for another memblock allocation and we can
free them correctly on removing page-table pages.

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:

1. 64T system
-------------
64T RAM would need 64G for vmemmap with struct page size being 64B.

128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K

2. 2G system
------------
2G RAM would need 2M for vmemmap with struct page size being 64B.

1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-2-aneesh.kumar@linux.ibm.com


# 2fb47060 04-Jun-2020 Mike Rapoport <rppt@kernel.org>

powerpc: add support for folded p4d page tables

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and replace 5level-fixup.h with pgtable-nop4d.h.

[rppt@linux.ibm.com: powerpc/xmon: drop unused pgdir varialble in show_pte() function]
Link: http://lkml.kernel.org/r/20200519181454.GI1059226@linux.ibm.com
[rppt@linux.ibm.com; build fix]
Link: http://lkml.kernel.org/r/20200423141845.GI13521@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> # 8xx and 83xx
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: James Morse <james.morse@arm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200414153455.21744-9-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 12e4d53f 03-Feb-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped. ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes. This makes it easy to backport
these changes. Only the first 2 patches need to be backported to stable.

The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

1) unhook page/directory
2) TLB invalidate
3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free. This is esp. trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie. caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days. An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures. I haven't added patches
w.r.t other architecture because they are yet to be acked.

This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages. In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush. We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already. Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

Link: http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52162ec7 24-Oct-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mm/book3s64/radix: Use freed_tables instead of need_flush_all

With commit 22a61c3c4f13 ("asm-generic/tlb: Track freeing of
page-table directories in struct mmu_gather") we now track whether we
freed page table in mmu_gather. Use that to decide whether to flush
Page Walk Cache.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191024075801.22434-2-aneesh.kumar@linux.ibm.com


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# f3935626 05-May-2019 Rick Lindsley <ricklind@linux.vnet.ibm.com>

powerpc/book3s/64: check for NULL pointer in pgd_alloc()

When the memset code was added to pgd_alloc(), it failed to consider
that kmem_cache_alloc() can return NULL. It's uncommon, but not
impossible under heavy memory contention. Example oops:

Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc0000000000a4000
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
CPU: 70 PID: 48471 Comm: entrypoint.sh Kdump: loaded Not tainted 4.14.0-115.6.1.el7a.ppc64le #1
task: c000000334a00000 task.stack: c000000331c00000
NIP: c0000000000a4000 LR: c00000000012f43c CTR: 0000000000000020
REGS: c000000331c039c0 TRAP: 0300 Not tainted (4.14.0-115.6.1.el7a.ppc64le)
MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022840 XER: 20040000
CFAR: c000000000008874 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
...
NIP [c0000000000a4000] memset+0x68/0x104
LR [c00000000012f43c] mm_init+0x27c/0x2f0
Call Trace:
mm_init+0x260/0x2f0 (unreliable)
copy_mm+0x11c/0x638
copy_process.isra.28.part.29+0x6fc/0x1080
_do_fork+0xdc/0x4c0
ppc_clone+0x8/0xc
Instruction dump:
409e000c b0860000 38c60002 409d000c 90860000 38c60004 78a0d183 78a506a0
7c0903a6 41820034 60000000 60420000 <f8860000> f8860008 f8860010 f8860018

Fixes: fc5c2f4a55a2 ("powerpc/mm/hash64: Zero PGD pages on allocation")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Rick Lindsley <ricklind@vnet.linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8a2cc87a 26-Apr-2019 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/mm: refactor pmd_pgtable()

pmd_pgtable() is identical on the 4 subarches, refactor it.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e80789a3 26-Apr-2019 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/mm: refactor definition of pgtable_cache[]

pgtable_cache[] is the same for the 4 subarches, lets make it common.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# dc096864 26-Apr-2019 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/mm: refactor pte_alloc_one() and pte_free() families definition.

Functions pte_alloc_one(), pte_alloc_one_kernel(), pte_free(),
pte_free_kernel() are identical for the four subarches.

This patch moves their definition in a common place.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# c746ca00 13-Feb-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/book3s: Remove pgd/pud/pmd_set() interfaces

When updating page tables, we need to make sure we fill the page table
entry valid bits. We do this by or'ing in one of PGD/PUD/PMD_VAL_BITS.

The page table 'set' interfaces allow updating the raw value of page
table entries without setting the valid bits, so remove those
interfaces to avoid incorrect usage in future.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Reword commit message based on mailing list discussion]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 4cf58924 03-Jan-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

mm: treewide: remove unused address argument from pte_alloc functions

Patch series "Add support for fast mremap".

This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

fn(...
- , T2 E2
)
{ ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

fn(...
-, E2
)

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 129dd323 29-Nov-2018 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/mm: enable the use of page table cache of order 0

hugepages uses a cache of order 0. Lets allow page tables
of order 0 in the common part in order to avoid open coding
in hugetlb

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a95d133c 29-Nov-2018 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/mm: Move pte_fragment_alloc() to a common location

In preparation of next patch which generalises the use of
pte_fragment_alloc() for all, this patch moves the related functions
in a place that is common to all subarches.

The 8xx will need that for supporting 16k pages, as in that mode
page tables still have a size of 4k.

Since pte_fragment with only once fragment is not different
from what is done in the general case, we can easily migrate all
subarchs to pte fragments.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a2dc009a 12-Aug-2018 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mm/book3s/radix: Add mapping statistics

Add statistics that show how memory is mapped within the kernel linear mapping.
This is similar to commit 37cd944c8d8f ("s390/pgtable: add mapping statistics")

We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize)
to map the kernel linear mapping and we print the linear psize during boot as
below.

"Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24"

A sample output looks like:

DirectMap4k: 0 kB
DirectMap64k: 18432 kB
DirectMap2M: 1030144 kB
DirectMap1G: 11534336 kB

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a984506c 19-Jul-2018 Michael Ellerman <mpe@ellerman.id.au>

powerpc/mm: Don't report PUDs as memory leaks when using kmemleak

Paul Menzel reported that kmemleak was producing reports such as:

unreferenced object 0xc0000000f8b80000 (size 16384):
comm "init", pid 1, jiffies 4294937416 (age 312.240s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000d997deb7>] __pud_alloc+0x80/0x190
[<0000000087f2e8a3>] move_page_tables+0xbac/0xdc0
[<00000000091e51c2>] shift_arg_pages+0xc0/0x210
[<00000000ab88670c>] setup_arg_pages+0x22c/0x2a0
[<0000000060871529>] load_elf_binary+0x41c/0x1648
[<00000000ecd9d2d4>] search_binary_handler.part.11+0xbc/0x280
[<0000000034e0cdd7>] __do_execve_file.isra.13+0x73c/0x940
[<000000005f953a6e>] sys_execve+0x58/0x70
[<000000009700a858>] system_call+0x5c/0x70

Indicating that a PUD was being leaked.

However what's really happening is that kmemleak is not able to
recognise the references from the PGD to the PUD, because they are not
fully qualified pointers.

We can confirm that in xmon, eg:

Find the task struct for pid 1 "init":
0:mon> P
task_struct ->thread.ksp PID PPID S P CMD
c0000001fe7c0000 c0000001fe803960 1 0 S 13 systemd

Dump virtual address 0 to find the PGD:
0:mon> dv 0 c0000001fe7c0000
pgd @ 0xc0000000f8b01000

Dump the memory of the PGD:
0:mon> d c0000000f8b01000
c0000000f8b01000 00000000f8b90000 0000000000000000 |................|
c0000000f8b01010 0000000000000000 0000000000000000 |................|
c0000000f8b01020 0000000000000000 0000000000000000 |................|
c0000000f8b01030 0000000000000000 00000000f8b80000 |................|
^^^^^^^^^^^^^^^^

There we can see the reference to our supposedly leaked PUD. But
because it's missing the leading 0xc, kmemleak won't recognise it.

We can confirm it's still in use by translating an address that is
mapped via it:
0:mon> dv 7fff94000000 c0000001fe7c0000
pgd @ 0xc0000000f8b01000
pgdp @ 0xc0000000f8b01038 = 0x00000000f8b80000 <--
pudp @ 0xc0000000f8b81ff8 = 0x00000000037c4000
pmdp @ 0xc0000000037c5ca0 = 0x00000000fbd89000
ptep @ 0xc0000000fbd89000 = 0xc0800001d5ce0386
Maps physical address = 0x00000001d5ce0000
Flags = Accessed Dirty Read Write

The fix is fairly simple. We need to tell kmemleak to ignore PUD
allocations and never report them as leaks. We can also tell it not to
scan the PGD, because it will never find pointers in there. However it
will still notice if we allocate a PGD and then leak it.

Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 738f9645 16-Apr-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Use page fragments for allocation page table at PMD level

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8a6c697b 16-Apr-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Implement helpers for pagetable fragment support at PMD level

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 0c4d2680 16-Apr-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/book3s64/mm: Simplify the rcu callback for page table free

Instead of encoding shift in the table address, use an enumerated index value.
This allow us to do different things in the callback for pte and pmd.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 1c7ec8a4 16-Apr-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm/book3s64/4k: Switch 4k pagesize config to use pagetable fragment

4K config use one full page at level 4 of the pagetable. Add support for single
fragment allocation in pagetable fragment code and and use that for 4K config.
This makes both 4k and 64k use the same code path. Later we will switch pmd to
use the page table fragment code. This is done only for 64bit platforms which
is using page table fragment support.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 872a100a 26-Mar-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm/hash: Don't memset pgd table if not needed

We need to zero-out pgd table only if we share the slab cache with
pud/pmd level caches. With the support of 4PB, we don't share the slab
cache anymore. Instead of removing the code completely hide it within
an #ifdef. We don't need to do this with any other page table level,
because they all allocate table of double the size and we take of
initializing the first half corrrectly during page table zap.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Consolidate multiple #if / #ifdef into one]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# fc5c2f4a 13-Feb-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm/hash64: Zero PGD pages on allocation

On powerpc we allocate page table pages from slab caches of different
sizes. Currently we have a constructor that zeroes out the objects when
we allocate them for the first time.

We expect the objects to be zeroed out when we free the the object
back to slab cache. This happens in the unmap path. For hugetlb pages
we call huge_pte_get_and_clear() to do that.

With the current configuration of page table size, both PUD and PGD
level tables are allocated from the same slab cache. At the PUD level,
we use the second half of the table to store the slot information. But
we never clear that when unmapping.

When such a freed object is then allocated for a PGD page, the second
half of the page table page will not be zeroed as expected. This
results in a kernel crash.

Fix it by always clearing PGD pages when they're allocated.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Change log wording and formatting, add whitespace]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# fae22116 11-Feb-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Fix crashes with 16G huge pages

To support memory keys, we moved the hash pte slot information to the
second half of the page table. This was ok with PTE entries at level
4 (PTE page) and level 3 (PMD). We already allocate larger page table
pages at those levels to accomodate extra details. For level 4 we
already have the extra space which was used to track 4k hash page
table entry details and at level 3 the extra space was allocated to
track the THP details.

With hugetlbfs PTE, we used this extra space at the PMD level to store
the slot details. But we also support hugetlbfs PTE at PUD level for
16GB pages and PUD level page didn't allocate extra space. This
resulted in memory corruption.

Fix this by allocating extra space at PUD level when HUGETLB is
enabled.

Fixes: bf9a95f9a648 ("powerpc: Free up four 64K PTE bits in 64K backed HPTE pages")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5b6c133e 15-Aug-2017 Michael Ellerman <mpe@ellerman.id.au>

powerpc/mm/nohash: Move definition of PGALLOC_GFP to fix build errors

In some obscure Book3E configs (randconfig) we can end up missing a
definition for PGALLOC_GFP in pgtable_64.c.

Fix it by moving the definition to asm/pgalloc.h.

Fixes: de3b87611dd1 ("powerpc/mm/book(e)(3s)/64: Add page table accounting")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# dcda9b04 12-Jul-2017 Michal Hocko <mhocko@suse.com>

mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
ignored for smaller sizes. This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)

- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
attempt to free memory at all. The most light weight mode which even
doesn't kick the background reclaim. Should be used carefully because
it might deplete the memory and the next user might hit the more
aggressive reclaim

- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
allocation without any attempt to free memory from the current
context but can wake kswapd to reclaim memory if the zone is below
the low watermark. Can be used from either atomic contexts or when
the request is a performance optimization and there is another
fallback for a slow path.

- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
non sleeping allocation with an expensive fallback so it can access
some portion of memory reserves. Usually used from interrupt/bh
context with an expensive slow path fallback.

- GFP_KERNEL - both background and direct reclaim are allowed and the
_default_ page allocator behavior is used. That means that !costly
allocation requests are basically nofail but there is no guarantee of
that behavior so failures have to be checked properly by callers
(e.g. OOM killer victim is allowed to fail currently).

- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
and all allocation requests fail early rather than cause disruptive
reclaim (one round of reclaim in this implementation). The OOM killer
is not invoked.

- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
behavior and all allocation requests try really hard. The request
will fail if the reclaim cannot make any progress. The OOM killer
won't be triggered.

- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
and all allocation requests will loop endlessly until they succeed.
This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# de3b8761 01-May-2017 Balbir Singh <bsingharora@gmail.com>

powerpc/mm/book(e)(3s)/64: Add page table accounting

Introduce a helper pgtable_gfp_flags() which
just returns the current gfp flags and adds
__GFP_ACCOUNT to account for page table allocation.
The generic helper is added to include/asm/pgalloc.h
and has two variants - WARNING ugly bits ahead

1. If the header is included from a module, no check
for mm == &init_mm is done, since init_mm is not
exported
2. For kernel includes, the check is done and required
see (3e79ec7 arch: x86: charge page tables to kmemcg)

The fundamental assumption is that no module should be
doing pgd/pud/pmd and pte alloc's on behalf of init_mm
directly.

NOTE: This adds an overhead to pmd/pud/pgd allocations
similar to x86. The other alternative was to implement
pmd_alloc_kernel/pud_alloc_kernel and pgd_alloc_kernel
with their offset variants.

For 4k page size, pte_alloc_one no longer calls
pte_alloc_one_kernel.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 2379a23e 24-Jun-2016 Michal Hocko <mhocko@suse.com>

powerpc: get rid of superfluous __GFP_REPEAT

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.

{pud,pmd}_alloc_one are allocating from {PGT,PUD}_CACHE initialized in
pgtable_cache_init which doesn't have larger than sizeof(void *) << 12
size and that fits into !costly allocation request size.

PGALLOC_GFP is used only in radix__pgd_alloc which uses either order-0
or order-4 requests. The first one doesn't need the flag while the
second does. Drop __GFP_REPEAT from PGALLOC_GFP and add it for the
order-4 one.

This means that this flag has never been actually useful here because it
has always been used only for PAGE_ALLOC_COSTLY requests.

Link: http://lkml.kernel.org/r/1464599699-30131-12-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32d6bd90 24-Jun-2016 Michal Hocko <mhocko@suse.com>

tree wide: get rid of __GFP_REPEAT for order-0 allocations part I

This is the third version of the patchset previously sent [1]. I have
basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
rid of superfluous gfp flags" which went through dm tree. I am sending
it now because it is tree wide and chances for conflicts are reduced
considerably when we want to target rc2. I plan to send the next step
and rename the flag and move to a better semantic later during this
release cycle so we will have a new semantic ready for 4.8 merge window
hopefully.

Motivation:

While working on something unrelated I've checked the current usage of
__GFP_REPEAT in the tree. It seems that a majority of the usage is and
always has been bogus because __GFP_REPEAT has always been about costly
high order allocations while we are using it for order-0 or very small
orders very often. It seems that a big pile of them is just a
copy&paste when a code has been adopted from one arch to another.

I think it makes some sense to get rid of them because they are just
making the semantic more unclear. Please note that GFP_REPEAT is
documented as

* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt

* _might_ fail. This depends upon the particular VM implementation.
while !costly requests have basically nofail semantic. So one could
reasonably expect that order-0 request with __GFP_REPEAT will not loop
for ever. This is not implemented right now though.

I would like to move on with __GFP_REPEAT and define a better semantic
for it.

$ git grep __GFP_REPEAT origin/master | wc -l
111
$ git grep __GFP_REPEAT | wc -l
36

So we are down to the third after this patch series. The remaining
places really seem to be relying on __GFP_REPEAT due to large allocation
requests. This still needs some double checking which I will do later
after all the simple ones are sorted out.

I am touching a lot of arch specific code here and I hope I got it right
but as a matter of fact I even didn't compile test for some archs as I
do not have cross compiler for them. Patches should be quite trivial to
review for stupid compile mistakes though. The tricky parts are usually
hidden by macro definitions and thats where I would appreciate help from
arch maintainers.

[1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org

This patch (of 19):

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we
have the full kernel tree with its usage for apparently order-0
allocations. This is really confusing because __GFP_REPEAT is
explicitly documented to allow allocation failures which is a weaker
semantic than the current order-0 has (basically nofail).

Let's simply drop __GFP_REPEAT from those places. This would allow to
identify place which really need allocator to retry harder and formulate
a more specific semantic for what the flag is supposed to do actually.

Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Crispin <blogic@openwrt.org>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a145abf1 08-Jun-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm/radix: Flush page walk cache when freeing page table

Even though a tlb_flush() does a flush with invalidate all cache,
we can end up doing an RCU page table free before calling tlb_flush().
That means we can have page walk cache entries even after we free the
page table pages. This can result in us doing wrong page table walk.

Avoid this by doing pwc flush on every page table free. We can't batch
the pwc flush, because the rcu call back function where we free the
page table pages doesn't have information of the mmu gather. Thus we
have to do a pwc on every page table page freed.

Note: I also removed the dummy tlb_flush_pgtable call functions for
hash 32.

Fixes: 1a472c9dba6b ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a2f41eb9 29-Apr-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Add radix pgalloc details

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 934828ed 29-Apr-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Make 4K and 64K use pte_t for pgtable_t

This patch switches 4K Linux page size config to use pte_t * type
instead of struct page * for pgtable_t. This simplifies the code a lot
and helps in consolidating both 64K and 4K page allocator routines. The
changes should not have any impact, because we already store physical
address in the upper level page table tree and that implies we already
do struct page * to physical address conversion.

One change to note here is we move the pgtable_page_dtor() call for
nohash to pte_fragment_free_mm(). The nohash related change is due to
the related changes in pgtable_64.c.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 74701d59 29-Apr-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Rename function to indicate we are allocating fragments

Only code cleanup. No functionality change.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# bcbe7f77 29-Apr-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Simplify the code dropping 4-level table #ifdef

Simplify the code by dropping 4-level page table #ifdef. We are always
4-level now.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 75a9b8a6 29-Apr-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Copy pgalloc (part 2)

This moves the nohash variant of pgalloc headers to nohash/ directory

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 101ad5c6 29-Apr-2016 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Make a copy of pgalloc.h for 32 and 64 book3s

This patch start to make a book3s variant for pgalloc headers. We have
multiple book3s specific changes such as:
* 4 level page table
* store physical address in higher level table
* use pte_t * for pgtable_t

Having a book3s64 specific variant helps to keep code simpler and remove
lots of #ifdef around code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>