History log of /linux-master/arch/s390/mm/pgtable.c
Revision Date Author Comments
# 0a845e0f 04-Mar-2024 Peter Xu <peterx@redhat.com>

mm/treewide: replace pud_large() with pud_leaf()

pud_large() is always defined as pud_leaf(). Merge their usages. Chose
pud_leaf() because pud_leaf() is a global API, while pud_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 2f709f7b 04-Mar-2024 Peter Xu <peterx@redhat.com>

mm/treewide: replace pmd_large() with pmd_leaf()

pmd_large() is always defined as pmd_leaf(). Merge their usages. Chose
pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a23f517b 11-Jan-2024 Kefeng Wang <wangkefeng.wang@huawei.com>

mm: convert mm_counter() to take a folio

Now all callers of mm_counter() have a folio, convert mm_counter() to take
a folio. Saves a call to compound_head() hidden inside PageAnon().

Link: https://lkml.kernel.org/r/20240111152429.3374566-10-willy@infradead.org
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0601ac88 11-Jan-2024 Kefeng Wang <wangkefeng.wang@huawei.com>

s390: use pfn_swap_entry_folio() in ptep_zap_swap_entry()

Call pfn_swap_entry_folio() in ptep_zap_swap_entry() as preparation for
converting mm counter functions to take a folio.

Link: https://lkml.kernel.org/r/20240111152429.3374566-5-willy@infradead.org
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a6d27ea0 05-Dec-2023 Claudio Imbrenda <imbrenda@linux.ibm.com>

s390/mm: convert pgste locking functions to C

Convert pgste_get_lock() and pgste_set_unlock() to C.

There is no real reasons to keep them in assembler. Having them in C
makes them more readable and maintainable, and better instructions are
used automatically when available.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20231205173252.62305-1-imbrenda@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>


# 27072b8e 09-Nov-2023 Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390/mm: Properly reset no-dat

When the CMMA state needs to be reset, the no-dat bit also needs to be
reset. Failure to do so could cause issues in the guest, since the
guest expects the bit to be cleared after a reset.

Cc: <stable@vger.kernel.org>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Message-ID: <20231109123624.37314-1-imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>


# 5c7f3bf0 08-Jun-2023 Hugh Dickins <hughd@google.com>

s390: allow pte_offset_map_lock() to fail

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Add comment on mm's contract with s390 above __zap_zero_pages(),
and fix old comment there: must be called after THP was disabled.

Link: https://lkml.kernel.org/r/3ff29363-336a-9733-12a1-5c31a45c8aeb@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0807b856 06-Feb-2023 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

s390/mm: add support for RDP (Reset DAT-Protection)

RDP instruction allows to reset DAT-protection bit in a PTE, with less
CPU synchronization overhead than IPTE instruction. In particular, IPTE
can cause machine-wide synchronization overhead, and excessive IPTE usage
can negatively impact machine performance.

RDP can be used instead of IPTE, if the new PTE only differs in SW bits
and _PAGE_PROTECT HW bit, for PTE protection changes from RO to RW.
SW PTE bit changes are allowed, e.g. for dirty and young tracking, but none
of the other HW-defined part of the PTE must change. This is because the
architecture forbids such changes to an active and valid PTE, which
is why invalidation with IPTE is always used first, before writing a new
entry.

The RDP optimization helps mainly for fault-driven SW dirty-bit tracking.
Writable PTEs are initially always mapped with HW _PAGE_PROTECT bit set,
to allow SW dirty-bit accounting on first write protection fault, where
the DAT-protection would then be reset. The reset is now done with RDP
instead of IPTE, if RDP instruction is available.

RDP cannot always guarantee that the DAT-protection reset is propagated
to all CPUs immediately. This means that spurious TLB protection faults
on other CPUs can now occur. For this, common code provides a
flush_tlb_fix_spurious_fault() handler, which will now be used to do a
CPU-local TLB flush. However, this will clear the whole TLB of a CPU, and
not just the affected entry. For more fine-grained flushing, by simply
doing a (local) RDP again, flush_tlb_fix_spurious_fault() would need to
also provide the PTE pointer.

Note that spurious TLB protection faults cannot really be distinguished
from racing pagetable updates, where another thread already installed the
correct PTE. In such a case, the local TLB flush would be unnecessary
overhead, but overall reduction of CPU synchronization overhead by not
using IPTE is still expected to be beneficial.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>


# 3ae11dbc 30-May-2022 Christian Borntraeger <borntraeger@linux.ibm.com>

s390/mm: use non-quiescing sske for KVM switch to keyed guest

The switch to a keyed guest does not require a classic sske as the other
guest CPUs are not accessing the key before the switch is complete.
By using the NQ SSKE things are faster especially with multiple guests.

Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Suggested-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20220530092706.11637-3-borntraeger@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>


# 4a366f51 21-Feb-2022 Heiko Carstens <hca@linux.ibm.com>

s390/mm,pgtable: don't use pte_val()/pXd_val() as lvalue

Convert pgtable code so pte_val()/pXd_val() aren't used as lvalue
anymore. This allows in later step to convert pte_val()/pXd_val() to
functions, which in turn makes it impossible to use these macros to
modify page table entries like they have been used before.

Therefore a construct like this:

pte_val(*pte) = __pa(addr) | prot;

which would directly write into a page table, isn't possible anymore
with the last step of this series.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>


# b8e3b379 21-Feb-2022 Heiko Carstens <hca@linux.ibm.com>

s390/mm: use set_pXd()/set_pte() helper functions everywhere

Use the new set_pXd()/set_pte() helper functions at all places where
page table entries are modified.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>


# 14ea40e2 09-Sep-2021 David Hildenbrand <david@redhat.com>

s390/mm: optimize reset_guest_reference_bit()

We already optimize get_guest_storage_key() to assume that if we don't have
a PTE table and don't have a huge page mapped that the storage key is 0.

Similarly, optimize reset_guest_reference_bit() to simply do nothing if
there is no PTE table and no huge page mapped.

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20210909162248.14969-10-david@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 7cb70266 09-Sep-2021 David Hildenbrand <david@redhat.com>

s390/mm: optimize set_guest_storage_key()

We already optimize get_guest_storage_key() to assume that if we don't have
a PTE table and don't have a huge page mapped that the storage key is 0.

Similarly, optimize set_guest_storage_key() to simply do nothing in case
the key to set is 0.

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20210909162248.14969-9-david@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 8318c404 09-Sep-2021 David Hildenbrand <david@redhat.com>

s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present

pte_map_lock() is sufficient.

Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20210909162248.14969-8-david@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 949f5c12 09-Sep-2021 David Hildenbrand <david@redhat.com>

s390/mm: fix VMA and page table handling code in storage key handling functions

There are multiple things broken about our storage key handling
functions:

1. We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
with read mmap_sem in munmap"). gfn_to_hva() will only translate using
KVM memory regions, but won't validate the VMA.

2. We should not allocate page tables outside of VMA boundaries: if
evil user space decides to map hugetlbfs to these ranges, bad things
will happen because we suddenly have PTE or PMD page tables where we
shouldn't have them.

3. We don't handle large PUDs that might suddenly appeared inside our page
table hierarchy.

Don't manually allocate page tables, properly validate that we have VMA and
bail out on pud_large().

All callers of page table handling functions, except
get_guest_storage_key(), call fixup_user_fault() in case they
receive an -EFAULT and retry; this will allocate the necessary page tables
if required.

To keep get_guest_storage_key() working as expected and not requiring
kvm_s390_get_skeys() to call fixup_user_fault() distinguish between
"there is simply no page table or huge page yet and the key is assumed
to be 0" and "this is a fault to be reported".

Although commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key handling")
introduced most of the affected code, it was actually already broken
before when using get_locked_pte() without any VMA checks.

Note: Ever since commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key
handling") we can no longer set a guest storage key (for example from
QEMU during VM live migration) without actually resolving a fault.
Although we would have created most page tables, we would choke on the
!pmd_present(), requiring a call to fixup_user_fault(). I would
have thought that this is problematic in combination with postcopy life
migration ... but nobody noticed and this patch doesn't change the
situation. So maybe it's just fine.

Fixes: 9fcf93b5de06 ("KVM: S390: Create helper function get_guest_storage_key")
Fixes: 24d5dd0208ed ("s390/kvm: Provide function for setting the guest storage key")
Fixes: a7e19ab55ffd ("KVM: s390: handle missing storage-key facility")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20210909162248.14969-5-david@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# fe3d1002 09-Sep-2021 David Hildenbrand <david@redhat.com>

s390/mm: validate VMA in PGSTE manipulation functions

We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
with read mmap_sem in munmap"). gfn_to_hva() will only translate using
KVM memory regions, but won't validate the VMA.

Further, we should not allocate page tables outside of VMA boundaries: if
evil user space decides to map hugetlbfs to these ranges, bad things will
happen because we suddenly have PTE or PMD page tables where we
shouldn't have them.

Similarly, we have to check if we suddenly find a hugetlbfs VMA, before
calling get_locked_pte().

Fixes: 2d42f9477320 ("s390/kvm: Add PGSTE manipulation functions")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20210909162248.14969-4-david@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 2e827528 02-Sep-2021 Heiko Carstens <hca@linux.ibm.com>

s390/mm: fix kernel doc comments

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>


# af5cdaf8 30-Jun-2021 Alistair Popple <apopple@nvidia.com>

mm: remove special swap entry functions

Patch series "Add support for SVM atomics in Nouveau", v11.

Introduction
============

Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==============

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.

Patches
=======

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
=======

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

This patch (of 10):

Remove multiple similar inline functions for dealing with different types
of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.

Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b02002cc 13-Jul-2020 Niklas Schnelle <schnelle@linux.ibm.com>

s390/pci: Implement ioremap_wc/prot() with MIO

With our current support for the new MIO PCI instructions, write
combining/write back MMIO memory can be obtained via the pci_iomap_wc()
and pci_iomap_wc_range() functions.
This is achieved by using the write back address for a specific bar
as provided in clp_store_query_pci_fn()

These functions are however not widely used and instead drivers often
rely on ioremap_wc() and ioremap_prot(), which on other platforms enable
write combining using a PTE flag set through the pgrprot value.

While we do not have a write combining flag in the low order flag bits
of the PTE like x86_64 does, with MIO support, there is a write back bit
in the physical address (bit 1 on z15) and thus also the PTE.
Which bit is used to toggle write back and whether it is available at
all, is however not fixed in the architecture. Instead we get this
information from the CLP Store Logical Processor Characteristics for PCI
command. When the write back bit is not provided we fall back to the
existing behavior.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>


# ca15ca40 07-Aug-2020 Mike Rapoport <rppt@kernel.org>

mm: remove unneeded includes of <asm/pgalloc.h>

Patch series "mm: cleanup usage of <asm/pgalloc.h>"

Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.

In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>

In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.

This patch (of 8):

In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.

As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.

The process was somewhat automated using

sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))

where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.

[rppt@linux.ibm.com: fix powerpc warning]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e31cf2f4 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: don't include asm/pgtable.h if linux/mm.h is already included

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 81a8f2be 07-Apr-2019 Thomas Huth <thuth@redhat.com>

s390/mm: silence compiler warning when compiling without CONFIG_PGSTE

If CONFIG_PGSTE is not set (e.g. when compiling without KVM), GCC complains:

CC arch/s390/mm/pgtable.o
arch/s390/mm/pgtable.c:413:15: warning: ‘pmd_alloc_map’ defined but not
used [-Wunused-function]
static pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr)
^~~~~~~~~~~~~

Wrap the function with "#ifdef CONFIG_PGSTE" to silence the warning.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 04a86453 05-Mar-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm: update ptep_modify_prot_commit to take old pte value as arg

Architectures like ppc64 require to do a conditional tlb flush based on
the old and new value of pte. Enable that by passing old pte value as
the arg.

Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0cbe3e26 05-Mar-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg

Patch series "NestMMU pte upgrade workaround for mprotect", v5.

We can upgrade pte access (R -> RW transition) via mprotect. We need to
make sure we follow the recommended pte update sequence as outlined in
commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to
handle nest MMU hang") for such updates. This patch series does that.

This patch (of 5):

Some architectures may want to call flush_tlb_range from these helpers.

Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32b77252 19-Dec-2018 Christoph Hellwig <hch@lst.de>

s390: remove the ptep_modify_prot_{start,commit} exports

These two functions are only used by core MM code, so no need to export
them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a9e00d83 13-Jul-2018 Janosch Frank <frankja@linux.ibm.com>

s390/mm: Add huge page gmap linking support

Let's allow huge pmd linking when enabled through the
KVM_CAP_S390_HPAGE_1M capability. Also we can now restrict gmap
invalidation and notification to the cases where the capability has
been activated and save some cycles when that's not the case.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>


# 637ff9ef 13-Jul-2018 Janosch Frank <frankja@linux.ibm.com>

s390/mm: Add huge pmd storage key handling

Storage keys for guests with huge page mappings have to be managed in
hardware. There are no PGSTEs for PMDs that we could use to retain the
guests's logical view of the key.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>


# 0959e168 17-Jul-2018 Janosch Frank <frankja@linux.ibm.com>

s390/mm: Add huge page dirty sync support

To do dirty loging with huge pages, we protect huge pmds in the
gmap. When they are written to, we unprotect them and mark them dirty.

We introduce the function gmap_test_and_clear_dirty_pmd which handles
dirty sync for huge pages.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>


# 6a376277 13-Jul-2018 Janosch Frank <frankja@linux.ibm.com>

s390/mm: Add gmap pmd invalidation and clearing

If the host invalidates a pmd, we also have to invalidate the
corresponding gmap pmds, as well as flush them from the TLB. This is
necessary, as we don't share the pmd tables between host and guest as
we do with ptes.

The clearing part of these three new functions sets a guest pmd entry
to _SEGMENT_ENTRY_EMPTY, so the guest will fault on it and we will
re-link it.

Flushing the gmap is not necessary in the host's lazy local and csp
cases. Both purge the TLB completely.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>


# 55531b74 15-Feb-2018 Janosch Frank <frankja@linux.vnet.ibm.com>

KVM: s390: Add storage key facility interpretation control

Up to now we always expected to have the storage key facility
available for our (non-VSIE) KVM guests. For huge page support, we
need to be able to disable it, so let's introduce that now.

We add the use_skf variable to manage KVM storage key facility
usage. Also we rename use_skey in the mm context struct to uses_skeys
to make it more clear that it is an indication that the vm actively
uses storage keys.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# ac41aaee 24-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

s390: mm: add SPDX identifiers to the remaining files

It's good to have SPDX identifiers in all files to make it easier to
audit the kernel tree for correct licenses.

Update the arch/s390/mm/ files with the correct SPDX license
identifier based on the license text in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of
the full boiler plate text.

This work is based on a script and data from Thomas Gleixner, Philippe
Ombredanne, and Kate Stewart.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 1bab1c02 29-Aug-2016 Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>

KVM: s390: expose no-DAT to guest and migration support

The STFLE bit 147 indicates whether the ESSA no-DAT operation code is
valid, the bit is not normally provided to the host; the host is
instead provided with an SCLP bit that indicates whether guests can
support the feature.

This patch:
* enables the STFLE bit in the guest if the corresponding SCLP bit is
present in the host.
* adds support for migrating the no-DAT bit in the PGSTEs
* fixes the software interpretation of the ESSA instruction that is
used when migrating, both for the new operation code and for the old
"set stable", as per specifications.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# cd774b90 26-Jul-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm,kvm: use nodat PGSTE tag to optimize TLB flushing

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 28c807e5 26-Jul-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: add guest ASCE TLB flush optimization

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 118bd31b 26-Jul-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: add no-dat TLB flush optimization

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 97ca7bfc 06-Jul-2017 Christian Borntraeger <borntraeger@de.ibm.com>

s390/mm: set change and reference bit on lazy key enablement

When we enable storage keys for a guest lazily, we reset the ACC and F
values. That is correct assuming that these are 0 on a clear reset and
the guest obviously has not used any key setting instruction.

We also zero out the change and reference bit. This is not correct as
the architecture prefers over-indication instead of under-indication
for the keyless->keyed transition.

This patch fixes the behaviour and always sets guest change and guest
reference for all guest storage keys on the keyless -> keyed switch.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 1aea9b3f 24-Apr-2017 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: implement 5 level pages tables

Add the logic to upgrade the page table for a 64-bit process to
five levels. This increases the TASK_SIZE from 8PB to 16EB-4K.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 2d42f947 20-Apr-2017 Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>

s390/kvm: Add PGSTE manipulation functions

Add PGSTE manipulation functions:
* set_pgste_bits sets specific bits in a PGSTE
* get_pgste returns the whole PGSTE
* pgste_perform_essa manipulates a PGSTE to set specific storage states
* ESSA_[SG]ET_* macros used to indicate the action for manipulate_pgste

Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Reviewed-by: Janosch Frank <frankja@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 2e4d8800 02-Mar-2017 Janosch Frank <frankja@linux.vnet.ibm.com>

KVM: s390: Fix guest migration for huge guests resulting in panic

While we can technically not run huge page guests right now, we can
setup a guest with huge pages. Trying to migrate it will trigger a
VM_BUG_ON and, if the kernel is not configured to panic on a BUG, it
will happily try to work on non-existing page table entries.

With this patch, we always return "dirty" if we encounter a large page
when migrating. This at least fixes the immediate problem until we
have proper handling for both kind of pages.

Fixes: 15f36eb ("KVM: s390: Add proper dirty bitmap support to S390 kvm.")
Cc: <stable@vger.kernel.org> # 3.16+

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 57d7f939 22-Mar-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390: add no-execute support

Bit 0x100 of a page table, segment table of region table entry
can be used to disallow code execution for the virtual addresses
associated with the entry.

There is one tricky bit, the system call to return from a signal
is part of the signal frame written to the user stack. With a
non-executable stack this would stop working. To avoid breaking
things the protection fault handler checks the opcode that caused
the fault for 0x0a77 (sys_sigreturn) and 0x0aad (sys_rt_sigreturn)
and injects a system call. This is preferable to the alternative
solution with a stub function in the vdso because it works for
vdso=off and statically linked binaries as well.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 4bead2a4 27-Jan-2017 Janosch Frank <frankja@linux.vnet.ibm.com>

KVM: s390: Fix RRBE return code not being CC

reset_guest_reference_bit needs to return the CC, so we can set it in
the guest PSW when emulating RRBE. Right now it only returns 0.

Let's fix that.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 0d6da872 23-Jan-2017 Christian Borntraeger <borntraeger@de.ibm.com>

s390/mm: Fix cmma unused transfer from pgste into pte

The last pgtable rework silently disabled the CMMA unused state by
setting a local pte variable (a parameter) instead of propagating it
back into the caller. Fix it.

Fixes: ebde765c0e85 ("s390/mm: uninline ptep_xxx functions from pgtable.h")
Cc: stable@vger.kernel.org # v4.6+
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 47e4d851 13-Jun-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: merge local / non-local IDTE helper

Merge the __p[m|u]xdp_idte and __p[m|u]dp_idte_local functions into a
single __p[m|u]dp_idte function with an additional parameter.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 34eeaf37 13-Jun-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: merge local / non-local IPTE helper

Merge the __ptep_ipte and __ptep_ipte_local functions into a single
__ptep_ipte function with an additional parameter. The __pte_ipte_range
function is still extra as the while loops makes it hard to merge.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# d08de8e2 04-Jul-2016 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

s390/mm: add support for 2GB hugepages

This adds support for 2GB hugetlbfs pages on s390.

Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a9d23e71 07-Mar-2016 David Hildenbrand <dahi@linux.vnet.ibm.com>

s390/mm: shadow pages with real guest requested protection

We really want to avoid manually handling protection for nested
virtualization. By shadowing pages with the protection the guest asked us
for, the SIE can handle most protection-related actions for us (e.g.
special handling for MVPG) and we can directly forward protection
exceptions to the guest.

PTEs will now always be shadowed with the correct _PAGE_PROTECT flag.
Unshadowing will take care of any guest changes to the parent PTE and
any host changes to the host PTE. If the host PTE doesn't have the
fitting access rights or is not available, we have to fix it up.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 4be130a0 07-Mar-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: add shadow gmap support

For a nested KVM guest the outer KVM host needs to create shadow
page tables for the nested guest. This patch adds the basic support
to the guest address space (gmap) code.

For each guest address space the inner KVM host creates, the first
outer KVM host needs to create shadow page tables. The address space
is identified by the ASCE loaded into the control register 1 at the
time the inner SIE instruction for the second nested KVM guest is
executed. The outer KVM host creates the shadow tables starting with
the table identified by the ASCE on a on-demand basis. The outer KVM
host will get repeated faults for all the shadow tables needed to
run the second KVM guest.

While a shadow page table for the second KVM guest is active the access
to the origin region, segment and page tables needs to be restricted
for the first KVM guest. For region and segment and page tables the first
KVM guest may read the memory, but write attempt has to lead to an
unshadow. This is done using the page invalid and read-only bits in the
page table of the first KVM guest. If the first guest re-accesses one of
the origin pages of a shadow, it gets a fault and the affected parts of
the shadow page table hierarchy needs to be removed again.

PGSTE tables don't have to be shadowed, as all interpretation assist can't
deal with the invalid bits in the shadow pte being set differently than
the original ones provided by the first KVM guest.

Many bug fixes and improvements by David Hildenbrand.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# b2d73b2a 08-Mar-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: extended gmap pte notifier

The current gmap pte notifier forces a pte into to a read-write state.
If the pte is invalidated the gmap notifier is called to inform KVM
that the mapping will go away.

Extend this approach to allow read-write, read-only and no-access
as possible target states and call the pte notifier for any change
to the pte.

This mechanism is used to temporarily set specific access rights for
a pte without doing the heavy work of a true mprotect call.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 64f31d58 25-May-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: simplify the TLB flushing code

ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to
decide between a lazy TLB flush vs an immediate TLB flush. The field
contains two 16-bit counters, the number of CPUs that have the mm
attached and can create TLB entries for it and the number of CPUs in
the middle of a page table update.

The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions
use the attach counter and a mask check with mm_cpumask(mm) to decide
between a local flush local of the current CPU and a global flush.

For all these functions the decision between lazy vs immediate and
local vs global TLB flush can be based on CPU masks. There are two
masks: the mm->context.cpu_attach_mask with the CPUs that are actively
using the mm, and the mm_cpumask(mm) with the CPUs that have used the
mm since the last full flush. The decision between lazy vs immediate
flush is based on the mm->context.cpu_attach_mask, to decide between
local vs global flush the mm_cpumask(mm) is used.

With this patch all checks will use the CPU masks, the old counter
mm->context.attach_count with its two 16-bit values is turned into a
single counter mm->context.flush_count that keeps track of the number
of CPUs with incomplete page table updates. The sole user of this
counter is finish_arch_post_lock_switch() which waits for the end of
all page table updates.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a9809407 06-Jun-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: fix vunmap vs finish_arch_post_lock_switch

The vunmap_pte_range() function calls ptep_get_and_clear() without any
locking. ptep_get_and_clear() uses ptep_xchg_lazy()/ptep_flush_direct()
for the page table update. ptep_flush_direct requires that preemption
is disabled, but without any locking this is not the case. If the kernel
preempts the task while the attach_counter is increased an endless loop
in finish_arch_post_lock_switch() will occur the next time the task is
scheduled.

Add explicit preempt_disable()/preempt_enable() calls to the relevant
functions in arch/s390/mm/pgtable.c.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 1c343f7b 13-Jun-2016 Christian Borntraeger <borntraeger@de.ibm.com>

KVM: s390/mm: Fix CMMA reset during reboot

commit 1e133ab296f ("s390/mm: split arch/s390/mm/pgtable.c") factored
out the page table handling code from __gmap_zap and __s390_reset_cmma
into ptep_zap_unused and added a simple flag that tells which one of the
function (reset or not) is to be made. This also changed the behaviour,
as it also zaps unused page table entries on reset.
Turns out that this is wrong as s390_reset_cmma uses the page walker,
which DOES NOT take the ptl lock.

The most simple fix is to not do the zapping part on reset (which uses
the walker)

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: 1e133ab296f ("s390/mm: split arch/s390/mm/pgtable.c")
Cc: stable@vger.kernel.org # 4.6+
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a7e19ab5 10-May-2016 David Hildenbrand <dahi@linux.vnet.ibm.com>

KVM: s390: handle missing storage-key facility

Without the storage-key facility, SIE won't interpret SSKE, ISKE and
RRBE for us. So let's add proper interception handlers that will be called
if lazy sske cannot be enabled.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 1824c723 10-May-2016 David Hildenbrand <dahi@linux.vnet.ibm.com>

KVM: s390: pfmf: support conditional-sske facility

We already indicate that facility but don't implement it in our pfmf
interception handler. Let's add a new storage key handling function for
conditionally setting the guest storage key.

As we will reuse this function later on, let's directly implement returning
the old key via parameter and indicating if any change happened via rc.

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 154c8c19 09-May-2016 David Hildenbrand <dahi@linux.vnet.ibm.com>

s390/mm: return key via pointer in get_guest_storage_key

Let's just split returning the key and reporting errors. This makes calling
code easier and avoids bugs as happened already.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 8d6037a7 09-May-2016 David Hildenbrand <dahi@linux.vnet.ibm.com>

s390/mm: simplify get_guest_storage_key

We can safe a few LOC and make that function easier to understand
by rewriting existing code.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# d3ed1cee 08-Mar-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: set and get guest storage key mmap locking

Move the mmap semaphore locking out of set_guest_storage_key
and get_guest_storage_key. This makes the two functions more
like the other ptep_xxx operations and allows to avoid repeated
semaphore operations if multiple keys are read or written.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# c427c42c 10-May-2016 David Hildenbrand <dahi@linux.vnet.ibm.com>

s390/mm: don't drop errors in get_guest_storage_key

Commit 1e133ab296f3 ("s390/mm: split arch/s390/mm/pgtable.c") changed
the return value of get_guest_storage_key to an unsigned char, resulting
in -EFAULT getting interpreted as a valid storage key.

Cc: stable@vger.kernel.org # 4.6+
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 1e133ab2 08-Mar-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: split arch/s390/mm/pgtable.c

The pgtable.c file is quite big, before it grows any larger split it
into pgtable.c, pgalloc.c and gmap.c. In addition move the gmap related
header definitions into the new gmap.h header and all of the pgste
helpers from pgtable.h to pgtable.c.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 227be799 08-Mar-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: uninline pmdp_xxx functions from pgtable.h

The pmdp_xxx function are smaller than their ptep_xxx counterparts
but to keep things symmetrical unline them as well.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# ebde765c 08-Mar-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: uninline ptep_xxx functions from pgtable.h

The code in the various ptep_xxx functions has grown quite large,
consolidate them to four out-of-line functions:
ptep_xchg_direct to exchange a pte with another with immediate flushing
ptep_xchg_lazy to exchange a pte with another in a batched update
ptep_modify_prot_start to begin a protection flags update
ptep_modify_prot_commit to commit a protection flags update

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 443a8133 24-Feb-2016 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/kvm: simplify set_guest_storage_key

Git commit ab3f285f227fec62868037e9b1b1fd18294a83b8
"KVM: s390/mm: try a cow on read only pages for key ops"
added a fixup_user_fault to set_guest_storage_key force a copy on
write if the page is mapped read-only. This is supposed to fix the
problem of differing storage keys for shared mappings, e.g. the
empty_zero_page.
But if the storage key is set before the pte is mapped the storage
key update is done on the pgste. A later fault will happily map the
shared page with the key from the pgste.

Eventually git commit 2faee8ff9dc6f4bfe46f6d2d110add858140fb20
"s390/mm: prevent and break zero page mappings in case of storage keys"
fixed this problem for the empty_zero_page. The commit makes sure that
guests enabled for storage keys will not use the empty_zero_page at all.

As the call to fixup_user_fault in set_guest_storage_key depends on the
order of the storage key operation vs. the fault that maps the pte
it does not really fix anything. Just remove it.

Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a9d7ab97 11-Jan-2016 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: use TASK_MAX_SIZE where applicable

To improve readability we can use TASK_MAX_SIZE when we just check for the
upper limit. All places explicitly dealing with 3 vs 4 level pgtables
were left unchanged.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-By: Sascha Silbe <silbe@linux.vnet.ibm.com>


# fef8953a 15-Jan-2016 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: enable fixup_user_fault retrying

By passing a non-null flag we allow fixup_user_fault to retry, which
enables userfaultfd. As during these retries we might drop the mmap_sem
we need to check if that happened and redo the complete chain of
actions.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4a9e1cda 15-Jan-2016 Dominik Dingel <dingel@linux.vnet.ibm.com>

mm: bring in additional flag for fixup_user_fault to signal unlock

During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.

The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault. Till now we never cared about retries, but as the
userfaultfd code kind of relies on it. this needs some fix.

This patchset does not take care of the futex code. I will now look
closer at this.

This patch (of 2):

With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.

This patch brings in the logic to handle retries as well as it cleans up
the current documentation. fixup_user_fault was not having the same
semantics as filemap_fault. It never indicated if a retry happened and
so a caller wasn't able to handle that case. So we now changed the
behaviour to always retry a locked mmap_sem.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fecffad2 15-Jan-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

s390, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eca56ff9 14-Jan-2016 Jerome Marchand <jmarchan@redhat.com>

mm, shmem: add internal shmem resident memory accounting

Currently looking at /proc/<pid>/status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.

The internal accounting currently counts shmem pages together with
regular files. As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES. The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before. The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3a92c31 01-Dec-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

KVM: s390: fix mismatch between user and in-kernel guest limit

While the userspace interface requests the maximum size the gmap code
expects to get a maximum address.

This error resulted in bigger page tables than necessary for some guest
sizes, e.g. a 2GB guest used 3 levels instead of 2.

At the same time we introduce KVM_S390_NO_MEM_LIMIT, which allows in a
bright future that a guest spans the complete 64 bit address space.

We also switch to TASK_MAX_SIZE for the initial memory size, this is a
cosmetic change as the previous size also resulted in a 4 level pagetable
creation.

Reported-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 78fb9076 14-Aug-2015 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: simplify page table alloc/free code

With the removal of the dynamic reallocation of page tables for
KVM (see git commit 0b46e0a3ec0d7a04af6a091354f1b5e1b952d70a)
the page table allocation / freeing code can be simplified.

The page table free code can now use the alloc_pgste bit in the
mm context to decide if a page table is 2K or 4K, there is no mix
of different sized page tables anymore. This eliminates the need
to use "page->_mapcount == 0" to check for 4K page table.

Use the lower two bits in page->_mapcount to indicate which
2K fragments of the 4K page are in use.

As 31-bit support is gone, remove the two defines ALLOC_ORDER
and FRAG_MASK and use the constants directly where appropriate.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 41318bfe 17-Jul-2015 Dominik Dingel <dingel@linux.vnet.ibm.com>

revert "s390/mm: make hugepages_supported a boot time decision"

Heiko noticed that the current check for hugepage support on s390 is a
little bit too harsh as systems which do not support will crash.

The reason is that pageblock_order can now get negative when we set
HPAGE_SHIFT to 0. To avoid all this and to avoid opening another can of
worms with enabling HUGETLB_PAGE_SIZE_VARIABLE I think it would be best
to simply allow architectures to define their own hugepages_supported().

Revert bea41197ead3 ("s390/mm: make hugepages_supported a boot time
decision") in preparation.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad4f99e8 17-Jul-2015 Dominik Dingel <dingel@linux.vnet.ibm.com>

revert "s390/mm: change HPAGE_SHIFT type to int"

Heiko noticed that the current check for hugepage support on s390 is a
little bit too harsh as systems which do not support will crash.

The reason is that pageblock_order can now get negative when we set
HPAGE_SHIFT to 0. To avoid all this and to avoid opening another can of
worms with enabling HUGETLB_PAGE_SIZE_VARIABLE I think it would be best
to simply allow architectures to define their own hugepages_supported().

This patch (of 4): revert commit cf54e2fce51c ("s390/mm: change
HPAGE_SHIFT type to int") in preparation.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf54e2fc 25-Jun-2015 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: change HPAGE_SHIFT type to int

With making HPAGE_SHIFT an unsigned integer we also accidentally changed
pageblock_order. In order to avoid compiler warnings we make
HPAGE_SHFIT an int again.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bea41197 25-Jun-2015 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: make hugepages_supported a boot time decision

There is a potential bug with KVM and hugetlbfs if the hardware does not
support hugepages (EDAT1). We fix this by making EDAT1 a hard requirement
for hugepages and therefore removing and simplifying code.

As s390, with the sw-emulated hugepages, was the only user of
arch_prepare/release_hugepage I also removed theses calls from common and
other architecture code.

This patch (of 5):

By dropping support for hugepages on machines which do not have the
hardware feature EDAT1, we fix a potential s390 KVM bug.

The bug would happen if a guest is backed by hugetlbfs (not supported
currently), but does not get pagetables with PGSTE. This would lead to
random memory overwrites.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b46e0a3 15-Apr-2015 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/kvm: remove delayed reallocation of page tables for KVM

Replacing a 2K page table with a 4K page table while a VMA is active
for the affected memory region is fundamentally broken. Rip out the
page table reallocation code and replace it with a simple system
control 'vm.allocate_pgste'. If the system control is set the page
tables for all processes are allocated as full 4K pages, even for
processes that do not need it.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 5a79859a 12-Feb-2015 Heiko Carstens <hca@linux.ibm.com>

s390: remove 31 bit support

Remove the 31 bit support in order to reduce maintenance cost and
effectively remove dead code. Since a couple of years there is no
distribution left that comes with a 31 bit kernel.

The 31 bit kernel also has been broken since more than a year before
anybody noticed. In addition I added a removal warning to the kernel
shown at ipl for 5 minutes: a960062e5826 ("s390: add 31 bit warning
message") which let everybody know about the plan to remove 31 bit
code. We didn't get any response.

Given that the last 31 bit only machine was introduced in 1999 let's
remove the code.
Anybody with 31 bit user space code can still use the compat mode.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 925dfc02 12-Dec-2014 Heiko Carstens <hca@linux.ibm.com>

s390/pgtable: add unsigned long casts

Get rid of warnings like this one:
warning: constant 0xffe0000000000000 is so big it is unsigned long

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# fbc89c95 07-Jan-2015 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: avoid using pmd_to_page for !USE_SPLIT_PMD_PTLOCKS

pmd_to_page() is only available if USE_SPLIT_PMD_PTLOCKS is defined.
The use of pmd_to_page in the gmap code can cause compile errors if
NR_CPUS is smaller than SPLIT_PTLOCK_CPUS. Do not use pmd_to_page
outside of USE_SPLIT_PMD_PTLOCKS sections.

Reported-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 9fcf93b5 23-Sep-2014 Jason J. Herne <jjherne@linux.vnet.ibm.com>

KVM: S390: Create helper function get_guest_storage_key

Define get_guest_storage_key which can be used to get the value of a guest
storage key. This compliments the functionality provided by the helper function
set_guest_storage_key. Both functions are needed for live migration of s390
guests that use storage keys.

Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# a697e051 30-Oct-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: use correct unlock function in gmap_ipte_notify

The page table lock is acquired with a call to get_locked_pte,
replace the plain spin_unlock with the correct unlock function
pte_unmap_unlock.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# edeb69e5 07-Oct-2014 Jason J. Herne <jjherne@us.ibm.com>

KVM: s390: Cleanup usage of current->mm in set_guest_storage_key

In set_guest_storage_key, we really want to reference the mm struct given as
a parameter to the function. So replace the current->mm reference with the
mm struct passed in by the caller.

Signed-off-by: Jason J. Herne <jjherne@us.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 6972cae5 15-Oct-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: missing pte for gmap_ipte_notify should trigger a VM_BUG

After fixup_user_fault does not fail we have a writeable pte.
That pte might transform but it should not vanish.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 3ac8e380 22-Oct-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: disable KSM for storage key enabled pages

When storage keys are enabled unmerge already merged pages and prevent
new pages from being merged.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 2faee8ff 22-Oct-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: prevent and break zero page mappings in case of storage keys

As soon as storage keys are enabled we need to stop working on zero page
mappings to prevent inconsistencies between storage keys and pgste.

Otherwise following data corruption could happen:
1) guest enables storage key
2) guest sets storage key for not mapped page X
-> change goes to PGSTE
3) guest reads from page X
-> as X was not dirty before, the page will be zero page backed,
storage key from PGSTE for X will go to storage key for zero page
4) guest sets storage key for not mapped page Y (same logic as above
5) guest reads from page Y
-> as Y was not dirty before, the page will be zero page backed,
storage key from PGSTE for Y will got to storage key for zero page
overwriting storage key for X

While holding the mmap sem, we are safe against changes on entries we
already fixed, as every fault would need to take the mmap_sem (read).

Other vCPUs executing storage key instructions will get a one time interception
and be serialized also with mmap_sem.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a13cff31 22-Oct-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: recfactor global pgste updates

Replace the s390 specific page table walker for the pgste updates
with a call to the common code walk_page_range function.
There are now two pte modification functions, one for the reset
of the CMMA state and another one for the initialization of the
storage keys.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 66e9bbdb 06-Oct-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: fixing calls of pte_unmap_unlock

pte_unmap works on page table entry pointers, derefencing should be avoided.
As on s390 pte_unmap is a NOP, this is more a cleanup if we want to supply
later such function.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# dc77d344 26-Aug-2014 Christian Borntraeger <borntraeger@de.ibm.com>

KVM: s390/mm: fix up indentation of set_guest_storage_key

commit ab3f285f227f ("KVM: s390/mm: try a cow on read only pages for
key ops")' misaligned a code block. Let's fixup the indentation.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c6c956b8 01-Jul-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

KVM: s390/mm: support gmap page tables with less than 5 levels

Add an addressing limit to the gmap address spaces and only allocate
the page table levels that are needed for the given limit. The limit
is fixed and can not be changed after a gmap has been created.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 527e30b4 30-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

KVM: s390/mm: use radix trees for guest to host mappings

Store the target address for the gmap segments in a radix tree
instead of using invalid segment table entries. gmap_translate
becomes a simple radix_tree_lookup, gmap_fault is split into the
address translation with gmap_translate and the part that does
the linking of the gmap shadow page table with the process page
table.
A second radix tree is used to keep the pointers to the segment
table entries for segments that are mapped in the guest address
space. On unmap of a segment the pointer is retrieved from the
radix tree and is used to carry out the segment invalidation in
the gmap shadow page table. As the radix tree can only store one
pointer, each host segment may only be mapped to exactly one
guest location.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 6e0a0431 29-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

KVM: s390/mm: cleanup gmap function arguments, variable names

Make the order of arguments for the gmap calls more consistent,
if the gmap pointer is passed it is always the first argument.
In addition distinguish between guest address and user address
by naming the variables gaddr for a guest address and vmaddr for
a user address.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 9da4e380 30-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

KVM: s390/mm: readd address parameter to gmap_do_ipte_notify

Revert git commit c3a23b9874c1 ("remove unnecessary parameter from
gmap_do_ipte_notify").

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# ab3f285f 19-Aug-2014 Christian Borntraeger <borntraeger@de.ibm.com>

KVM: s390/mm: try a cow on read only pages for key ops

The PFMF instruction handler blindly wrote the storage key even if
the page was mapped R/O in the host. Lets try a COW before continuing
and bail out in case of errors.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org


# 152125b7 24-Jul-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: implement dirty bits for large segment table entries

The large segment table entry format has block of bits for the
ACC/F values for the large page. These bits are valid only if
another bit (AV bit 0x10000) of the segment table entry is set.
The ACC/F bits do not have a meaning if the AV bit is off.
This allows to put the THP splitting bit, the segment young bit
and the new segment dirty bit into the ACC/F bits as long as
the AV bit stays off. The dirty and young information is only
available if the pmd is large.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 55e4283c 25-Jul-2014 Christian Borntraeger <borntraeger@de.ibm.com>

KVM: s390/mm: Fix page table locking vs. split pmd lock

commit ec66ad66a0de87866be347b5ecc83bd46427f53b (s390/mm: enable
split page table lock for PMD level) activated the split pmd lock
for s390. Turns out that we missed one place: We also have to take
the pmd lock instead of the page table lock when we reallocate the
page tables (==> changing entries in the PMD) during sie enablement.

Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# beef560b 14-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/uaccess: simplify control register updates

Always switch to the kernel ASCE in switch_mm. Load the secondary
space ASCE in finish_arch_post_lock_switch after checking that
any pending page table operations have completed. The primary
ASCE is loaded in entry[64].S. With this the update_primary_asce
call can be removed from the switch_to macro and from the start
of switch_mm function. Remove the load_primary argument from
update_user_asce/clear_user_asce, rename update_user_asce to
set_user_asce and rename update_primary_asce to load_kernel_asce.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 3a801517 16-May-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

KVM: s390: correct locking for s390_enable_skey

Use the mm semaphore to serialize multiple invocations of s390_enable_skey.
The second CPU faulting on a storage key operation needs to wait for the
completion of the page table update. Taking the mm semaphore writable
has the positive side-effect that it prevents any host faults from
taking place which does have implications on keys vs PGSTE.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# a0bf4f14 24-Mar-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

KVM: s390/mm: new gmap_test_and_clear_dirty function

For live migration kvm needs to test and clear the dirty bit of guest pages.

That for is ptep_test_and_clear_user_dirty, to be sure we are not racing with
other code, we protect the pte. This needs to be done within
the architecture memory management code.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 0a61b222 17-Oct-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

KVM: s390/mm: use software dirty bit detection for user dirty tracking

Switch the user dirty bit detection used for migration from the hardware
provided host change-bit in the pgste to a fault based detection method.
This reduced the dependency of the host from the storage key to a point
where it becomes possible to enable the RCP bypass for KVM guests.

The fault based dirty detection will only indicate changes caused
by accesses via the guest address space. The hardware based method
can detect all changes, even those caused by I/O or accesses via the
kernel page table. The KVM/qemu code needs to take this into account.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 934bc131 14-Jan-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

KVM: s390: Allow skeys to be enabled for the current process

Introduce a new function s390_enable_skey(), which enables storage key
handling via setting the use_skey flag in the mmu context.

This function is only useful within the context of kvm.

Note that enabling storage keys will cause a one-time hickup when
walking the page table; however, it saves us special effort for cases
like clear reset while making it possible for us to be architecture
conform.

s390_enable_skey() takes the page table lock to prevent reseting
storage keys triggered from multiple vcpus.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# d4cb1134 29-Jan-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

KVM: s390: Clear storage keys

page_table_reset_pgste() already does a complete page table walk to
reset the pgste. Enhance it to initialize the storage keys to
PAGE_DEFAULT_KEY if requested by the caller. This will be used
for lazy storage key handling. Also provide an empty stub for
!CONFIG_PGSTE

Lets adopt the current code (diag 308) to not clear the keys.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>


# 1e1836e8 07-Apr-2014 Alex Thorlton <athorlton@sgi.com>

mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified, and using a malloc hook with
madvise is not an option (i.e. statically allocated data). This patch
allows us to do just that, without affecting other jobs running on the
system.

We need to do this sort of thing for jobs where THP hurts performance,
due to the possibility of increased remote memory accesses that can be
created by situations such as the following:

When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
be handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from. If many remote nodes need to do work on
that same chunk, they'll be making remote accesses.

With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.

This patch is based on some of my work combined with some
suggestions/patches given by Oleg Nesterov. The main goal here is to
add a prctl switch to allow us to disable to THP on a per mm_struct
basis.

Here's a bit of test data with the new patch in place...

First with the flag unset:

# perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 0
Set thp_disabled state to 0
Process pid = 18027

PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23

Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

273719072.841402 task-clock # 641.026 CPUs utilized [100.00%]
1,008,986 context-switches # 0.000 M/sec [100.00%]
7,717 CPU-migrations # 0.000 M/sec [100.00%]
1,698,932 page-faults # 0.000 M/sec
355,222,544,890,379 cycles # 1.298 GHz [100.00%]
536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%]
409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%]
148,286,797,266,411 instructions # 0.42 insns per cycle
# 3.62 stalled cycles per insn [100.00%]
27,061,793,159,503 branches # 98.867 M/sec [100.00%]
1,188,655,196 branch-misses # 0.00% of all branches

427.001706337 seconds time elapsed

Now with the flag set:

# perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 1
Set thp_disabled state to 1
Process pid = 144957

PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23

Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

138789390.540183 task-clock # 641.959 CPUs utilized [100.00%]
534,205 context-switches # 0.000 M/sec [100.00%]
4,595 CPU-migrations # 0.000 M/sec [100.00%]
63,133,119 page-faults # 0.000 M/sec
147,977,747,269,768 cycles # 1.066 GHz [100.00%]
200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%]
105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%]
180,916,213,503,160 instructions # 1.22 insns per cycle
# 1.11 stalled cycles per insn [100.00%]
26,999,511,005,868 branches # 194.536 M/sec [100.00%]
714,066,351 branch-misses # 0.00% of all branches

216.196778807 seconds time elapsed

As with previous versions of the patch, We're getting about a 2x
performance increase here. Here's a link to the test case I used, along
with the little wrapper to activate the flag:

http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

This patch (of 3):

Revert commit 8e72033f2a48 and add in code to fix up any issues caused
by the revert.

The revert is necessary because hugepage_madvise would return -EINVAL
when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
patch set.

Here's a snip of an e-mail from Gerald detailing the original purpose of
this code, and providing justification for the revert:

"The intent of commit 8e72033f2a48 was to guard against any future
programming errors that may result in an madvice(MADV_HUGEPAGE) on
guest mappings, which would crash the kernel.

Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
8e72033f2a48 was to be reverted, because that check will also prevent
a kernel crash in the case described above, it will now send a
SIGSEGV instead.

This would now also allow to do the madvise on other parts, if
needed, so it is a more flexible approach. One could also say that
it would have been better to do it this way right from the
beginning..."

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 457f2180 21-Mar-2014 Heiko Carstens <hca@linux.ibm.com>

s390/uaccess: rework uaccess code - fix locking issues

The current uaccess code uses a page table walk in some circumstances,
e.g. in case of the in atomic futex operations or if running on old
hardware which doesn't support the mvcos instruction.

However it turned out that the page table walk code does not correctly
lock page tables when accessing page table entries.
In other words: a different cpu may invalidate a page table entry while
the current cpu inspects the pte. This may lead to random data corruption.

Adding correct locking however isn't trivial for all uaccess operations.
Especially copy_in_user() is problematic since that requires to hold at
least two locks, but must be protected against ABBA deadlock when a
different cpu also performs a copy_in_user() operation.

So the solution is a different approach where we change address spaces:

User space runs in primary address mode, or access register mode within
vdso code, like it currently already does.

The kernel usually also runs in home space mode, however when accessing
user space the kernel switches to primary or secondary address mode if
the mvcos instruction is not available or if a compare-and-swap (futex)
instruction on a user space address is performed.
KVM however is special, since that requires the kernel to run in home
address space while implicitly accessing user space with the sie
instruction.

So we end up with:

User space:
- runs in primary or access register mode
- cr1 contains the user asce
- cr7 contains the user asce
- cr13 contains the kernel asce

Kernel space:
- runs in home space mode
- cr1 contains the user or kernel asce
-> the kernel asce is loaded when a uaccess requires primary or
secondary address mode
- cr7 contains the user or kernel asce, (changed with set_fs())
- cr13 contains the kernel asce

In case of uaccess the kernel changes to:
- primary space mode in case of a uaccess (copy_to_user) and uses
e.g. the mvcp instruction to access user space. However the kernel
will stay in home space mode if the mvcos instruction is available
- secondary space mode in case of futex atomic operations, so that the
instructions come from primary address space and data from secondary
space

In case of kvm the kernel runs in home space mode, but cr1 gets switched
to contain the gmap asce before the sie instruction gets executed. When
the sie instruction is finished cr1 will be switched back to contain the
user asce.

A context switch between two processes will always load the kernel asce
for the next process in cr1. So the first exit to user space is a bit
more expensive (one extra load control register instruction) than before,
however keeps the code rather simple.

In sum this means there is no need to perform any error prone page table
walks anymore when accessing user space.

The patch seems to be rather large, however it mainly removes the
the page table walk code and restores the previously deleted "standard"
uaccess code, with a couple of changes.

The uaccess without mvcos mode can be enforced with the "uaccess_primary"
kernel parameter.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 1b948d6c 03-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm,tlb: optimize TLB flushing for zEC12

The zEC12 machines introduced the local-clearing control for the IDTE
and IPTE instruction. If the control is set only the TLB of the local
CPU is cleared of entries, either all entries of a single address space
for IDTE, or the entry for a single page-table entry for IPTE.
Without the local-clearing control the TLB flush is broadcasted to all
CPUs in the configuration, which is expensive.

The reset of the bit mask of the CPUs that need flushing after a
non-local IDTE is tricky. As TLB entries for an address space remain
in the TLB even if the address space is detached a new bit field is
required to keep track of attached CPUs vs. CPUs in the need of a
flush. After a non-local flush with IDTE the bit-field of attached CPUs
is copied to the bit-field of CPUs in need of a flush. The ordering
of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is
such that an underindication in mm_cpumask(mm) is prevented but an
overindication in mm_cpumask(mm) is possible.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 02a8f3ab 03-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm,tlb: safeguard against speculative TLB creation

The principles of operations states that the CPU is allowed to create
TLB entries for an address space anytime while an ASCE is loaded to
the control register. This is true even if the CPU is running in the
kernel and the user address space is not (actively) accessed.

In theory this can affect two aspects of the TLB flush logic.
For full-mm flushes the ASCE of the dying process is still attached.
The approach to flush first with IDTE and then just free all page
tables can in theory lead to stale TLB entries. Use the batched
free of page tables for the full-mm flushes as well.

For operations that can have a stale ASCE in the control register,
e.g. a delayed update_user_asce in switch_mm, load the kernel ASCE
to prevent invalid TLBs from being created.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# aaeff84a 19-Mar-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: remove unnecessary parameter from gmap_do_ipte_notify

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# c7c5be73 19-Mar-2014 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: fixing comment so that parameter name match

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# ec66ad66 12-Feb-2014 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: enable split page table lock for PMD level

Add the pgtable_pmd_page_ctor/pgtable_pmd_page_dtor calls to the pmd
allocation and free functions and enable ARCH_ENABLE_SPLIT_PMD_PTLOCK
for 64 bit.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# deedabb2 21-May-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/kvm: set guest page states to stable on re-ipl

The guest page state needs to be reset to stable for all pages
on initial program load via diagnose 0x308.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# b31288fa 17-Apr-2013 Konstantin Weitz <konstantin.weitz@gmail.com>

s390/kvm: support collaborative memory management

This patch enables Collaborative Memory Management (CMM) for kvm
on s390. CMM allows the guest to inform the host about page usage
(see arch/s390/mm/cmm.c). The host uses this information to avoid
swapping in unused pages in the page fault handler. Further, a CPU
provided list of unused invalid pages is processed to reclaim swap
space of not yet accessed unused pages.

[ Martin Schwidefsky: patch reordering and cleanup ]

Signed-off-by: Konstantin Weitz <konstantin.weitz@gmail.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# b4a96015 12-Dec-2013 Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

s390: Fix misspellings using 'codespell' tool

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# e89cfa58 14-Nov-2013 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

s390: handle pgtable_page_ctor() fail

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c389a250 14-Nov-2013 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, thp: do not access mm->pmd_huge_pte directly

Currently mm->pmd_huge_pte protected by page table lock. It will not
work with split lock. We have to have per-pmd pmd_huge_pte for proper
access serialization.

For now, let's just introduce wrapper to access mm->pmd_huge_pte.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 10607864 28-Oct-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm,tlb: correct tlb flush on page table upgrade

The IDTE instruction used to flush TLB entries for a specific address
space uses the address-space-control element (ASCE) to identify
affected TLB entries. The upgrade of a page table adds a new top
level page table which changes the ASCE. The TLB entries associated
with the old ASCE need to be flushed and the ASCE for the address space
needs to be replaced synchronously on all CPUs which currently use it.
The concept of a lazy ASCE update with an exception handler is broken.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# be39f196 31-Oct-2013 Dominik Dingel <dingel@linux.vnet.ibm.com>

s390/mm: page_table_realloc returns failure

There is a possible race between setting has_pgste and reallocation of the
page_table, change the order to fix this.
Also page_table_alloc_pgste can fail, in that case we need to backpropagte this
as -ENOMEM to the caller of page_table_realloc.

Based on a patch by Christian Borntraeger <borntraeger@de.ibm.com>.

Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# e258d719 24-Sep-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/uaccess: always run the kernel in home space

Simplify the uaccess code by removing the user_mode=home option.
The kernel will now always run in the home space mode.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 63df41d6 06-Sep-2013 Heiko Carstens <hca@linux.ibm.com>

s390: make various functions static, add declarations to header files

Make various functions static, add declarations to header files to
fix a couple of sparse findings.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 984e2a59 06-Sep-2013 Heiko Carstens <hca@linux.ibm.com>

s390/mm: add __releases()/__acquires() annotations to gmap_alloc_table()

Let sparse not incorrectly complain about unbalanced locking.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 0944fe3f 23-Jul-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: implement software referenced bits

The last remaining use for the storage key of the s390 architecture
is reference counting. The alternative is to make page table entries
invalid while they are old. On access the fault handler marks the
pte/pmd as young which makes the pte/pmd valid if the access rights
allow read access. The pte/pmd invalidations required for software
managed reference bits cost a bit of performance, on the other hand
the RRBE/RRBM instructions to read and reset the referenced bits are
quite expensive as well.

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 5c474a1e 16-Aug-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: introduce ptep_flush_lazy helper

Isolate the logic of IDTE vs. IPTE flushing of ptes in two functions,
ptep_flush_lazy and __tlb_flush_mm_lazy.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# e5098611 23-Jul-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: cleanup page table definitions

Improve the encoding of the different pte types and the naming of the
page, segment table and region table bits. Due to the different pte
encoding the hugetlbfs primitives need to be adapted as well. To improve
compatability with common code make the huge ptes use the encoding of
normal ptes. The conversion between the pte and pmd encoding for a huge
pte is done with set_huge_pte_at and huge_ptep_get.
Overall the code is now easier to understand.

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# ee6ee55b 26-Jul-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

KVM: s390: fix task size check

The gmap_map_segment function uses PGDIR_SIZE in the check for the
maximum address in the tasks address space. This incorrectly limits
the amount of memory usable for a kvm guest to 4TB. The correct limit
is (1UL << 53). As the TASK_SIZE has different values (4TB vs 8PB)
dependent on the existance of the fourth page table level, create
a new define 'TASK_MAX_SIZE' for (1UL << 53).

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3eabaee9 26-Jul-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

KVM: s390: allow sie enablement for multi-threaded programs

Improve the code to upgrade the standard 2K page tables to 4K page tables
with PGSTEs to allow the operation to happen when the program is already
multi-threaded.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 24d5dd02 27-May-2013 Christian Borntraeger <borntraeger@de.ibm.com>

s390/kvm: Provide function for setting the guest storage key

From time to time we need to set the guest storage key. Lets
provide a helper function that handles the changes with all the
right locking and checking.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 6b0b50b0 05-Jun-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

mm/THP: add pmd args to pgtable deposit and withdraw APIs

This will be later used by powerpc THP support. In powerpc we want to use
pgtable for storing the hash index values. So instead of adding them to
mm_context list, we would like to store them in the second half of pmd

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# db70ccdf 12-Jun-2013 Christian Borntraeger <borntraeger@de.ibm.com>

KVM: s390: Provide function for setting the guest storage key

From time to time we need to set the guest storage key. Lets
provide a helper function that handles the changes with all the
right locking and checking.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e86cbd87 29-May-2013 Christian Borntraeger <borntraeger@de.ibm.com>

s390/pgtable: Fix gmap notifier address

The address of the gmap notifier was broken, resulting in
unhandled validity intercepts in KVM. Fix the rmap->vmaddr
to be on a segment boundary.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# f8b5ff2c 17-May-2013 Christian Borntraeger <borntraeger@de.ibm.com>

s390: fix gmap_ipte_notifier vs. software dirty pages

On heavy paging load some guest cpus started to loop in gmap_ipte_notify.
This was visible as stalled cpus inside the guest. The gmap_ipte_notifier
tries to map a user page and then made sure that the pte is valid and
writable. Turns out that with the software change bit tracking the pte
can become read-only (and only software writable) if the page is clean.
Since we loop in this code, the page would stay clean and, therefore,
be never writable again.
Let us just use fixup_user_fault, that guarantees to call handle_mm_fault.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>


# 0d0dafc1 17-May-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/kvm: rename RCP_xxx defines to PGSTE_xxx

The RCP byte is a part of the PGSTE value, the existing RCP_xxx names
are inaccurate. As the defines describe bits and pieces of the PGSTE,
the names should start with PGSTE_. The KVM_UR_BIT and KVM_UC_BIT are
part of the PGSTE as well, give them better names as well.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>


# bb4b42ce 08-May-2013 Christian Borntraeger <borntraeger@de.ibm.com>

s390: fix gmap_ipte_notifier vs. software dirty pages

On heavy paging load some guest cpus started to loop in gmap_ipte_notify.
This was visible as stalled cpus inside the guest. The gmap_ipte_notifier
tries to map a user page and then made sure that the pte is valid and
writable. Turns out that with the software change bit tracking the pte
can become read-only (and only software writable) if the page is clean.
Since we loop in this code, the page would stay clean and, therefore,
be never writable again.
Let us just use fixup_user_fault, that guarantees to call handle_mm_fault.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# d3383632 17-Apr-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: add pte invalidation notifier for kvm

Add a notifier for kvm to get control before a page table entry is
invalidated. The notifier is only called for ptes of an address space
with pgstes that have been explicitly marked to require notification.
Kvm will use this to get control before prefix pages of virtual CPU
are unmapped.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# ab8e5235 16-Apr-2013 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm,gmap: segment mapping race

The gmap_map_segment function creates a special invalid segment table
entry with the address of the requested target location in the process
address space. The first access will create the connection between the
gmap segment table and the target page table of the main process.
If two threads do this concurrently both will walk the page tables and
allocate a gmap_rmap structure for the same segment table entry.
To avoid the race recheck the segment table entry after taking to page
table lock.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# c5034945 10-Sep-2012 Heiko Carstens <hca@linux.ibm.com>

s390/mm,gmap: implement gmap_translate()

Implement gmap_translate() function which translates a guest absolute address
to a user space process address without establishing the guest page table
entries.

This is useful for kvm guest address translations where no memory access
is expected to happen soon (e.g. tprot exception handler).

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 9e0fdb41 05-Mar-2013 Heiko Carstens <hca@linux.ibm.com>

s390/mm,gmap: implement gmap_translate()

Implement gmap_translate() function which translates a guest absolute address
to a user space process address without establishing the guest page table
entries.

This is useful for kvm guest address translations where no memory access
is expected to happen soon (e.g. tprot exception handler).

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 0a4ccc99 02-Nov-2012 Heiko Carstens <hca@linux.ibm.com>

s390/mm: move kernel_page_present/kernel_map_pages to page_attr.c

Keep related functions together and move to appropriate file.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 1ae1c1d0 08-Oct-2012 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

thp, s390: architecture backend for thp on s390

This implements the architecture backend for transparent hugepages
on s390.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 274023da 08-Oct-2012 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

thp, s390: disable thp for kvm host on s390

This patch is part of the architecture backend for thp on s390. It
disables thp for kvm hosts, because there is no kvm host hugepage support
so far. Existing thp mappings are split by follow_page() with FOLL_SPLIT,
and future thp mappings are prevented by setting VM_NOHUGEPAGE in
mm->def_flags.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9501d09f 08-Oct-2012 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

thp, s390: thp pagetable pre-allocation for s390

This patch is part of the architecture backend for thp on s390. It
provides the pagetable pre-allocation functions
pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw(). Unlike
other archs, s390 has no struct page * as pgtable_t, but rather a pointer
to the page table. So instead of saving the pagetable pre- allocation
list info inside the struct page, it is being saved within the pagetable
itself.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75077afb 08-Oct-2012 Gerald Schaefer <gerald.schaefer@linux.ibm.com>

thp, s390: thp splitting backend for s390

This patch is part of the architecture backend for thp on s390. It
provides the functions related to thp splitting, including serialization
against gup. Unlike other archs, pmdp_splitting_flush() cannot use a tlb
flushing operation to serialize against gup on s390, because that wouldn't
be stopped by the disabled IRQs. So instead, smp_call_function() is
called with an empty function, which will have the expected effect.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41459d36 14-Sep-2012 Heiko Carstens <hca@linux.ibm.com>

s390: add uninitialized_var() to suppress false positive compiler warnings

Get rid of these:

arch/s390/kernel/smp.c:134:19: warning: ‘status’ may be used uninitialized in this function [-Wuninitialized]
arch/s390/mm/pgtable.c:641:10: warning: ‘table’ may be used uninitialized in this function [-Wuninitialized]
arch/s390/mm/pgtable.c:644:12: warning: ‘page’ may be used uninitialized in this function [-Wuninitialized]
drivers/s390/cio/cio.c:1037:14: warning: ‘schid’ may be used uninitialized in this function [-Wuninitialized]

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# d1b0d842 02-Sep-2012 Heiko Carstens <hca@linux.ibm.com>

s390/mm: rename addressing_mode to s390_user_mode

Renaming the globally visible variable "user_mode" to "addressing_mode" in
order to fix a name clash was not a good idea. (Commit 37fe1d73 "s390/mm:
rename user_mode variable to addressing_mode")
Looking at the code after a couple of weeks one thinks: addressing mode of
what?
So rename the variable again. This time to s390_user_mode. Which hopefully
makes more sense.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 37fe1d73 27-Jul-2012 Heiko Carstens <hca@linux.ibm.com>

s390/mm: rename user_mode variable to addressing_mode

Fix name clash with user_mode() define which is also used in common code.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 0f6f281b 26-Jul-2012 Martin Schwidefsky <schwidefsky@de.ibm.com>

s390/mm: downgrade page table after fork of a 31 bit process

The downgrade of the 4 level page table created by init_new_context is
currently done only in start_thread31. If a 31 bit process forks the
new mm uses a 4 level page table, including the task size of 2<<42
that goes along with it. This is incorrect as now a 31 bit process
can map memory beyond 2GB. Define arch_dup_mmap to do the downgrade
after fork.

Cc: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a53c8fab 20-Jul-2012 Heiko Carstens <hca@linux.ibm.com>

s390/comments: unify copyright messages and remove file names

Remove the file name from the comment at top of many files. In most
cases the file name was wrong anyway, so it's rather pointless.

Also unify the IBM copyright statement. We did have a lot of sightly
different statements and wanted to change them one after another
whenever a file gets touched. However that never happened. Instead
people start to take the old/"wrong" statements to use as a template
for new files.
So unify all of them in one go.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 2739b6d1 09-May-2012 Christian Borntraeger <borntraeger@de.ibm.com>

s390/kvm: bad rss-counter state

commit c3f0327f8e9d7a503f0d64573c311eddd61f197d
mm: add rss counters consistency check
detected the following problem with kvm on s390:

BUG: Bad rss-counter state mm:00000004f73ef000 idx:0 val:-10
BUG: Bad rss-counter state mm:00000004f73ef000 idx:1 val:-5

We have to make sure that we accumulate all rss values into
the mm before we replace the mm to avoid triggering this (harmless)
bug message.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# cd94154cc 11-Apr-2012 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] fix tlb flushing for page table pages

Git commit 36409f6353fc2d7b6516e631415f938eadd92ffa "use generic RCU
page-table freeing code" introduced a tlb flushing bug. Partially revert
the above git commit and go back to s390 specific page table flush code.

For s390 the TLB can contain three types of entries, "normal" TLB
page-table entries, TLB combined region-and-segment-table (CRST) entries
and real-space entries. Linux does not use real-space entries which
leaves normal TLB entries and CRST entries. The CRST entries are
intermediate steps in the page-table translation called translation paths.
For example a 4K page access in a three-level page table setup will
create two CRST TLB entries and one page-table TLB entry. The advantage
of that approach is that a page access next to the previous one can reuse
the CRST entries and needs just a single read from memory to create the
page-table TLB entry. The disadvantage is that the TLB flushing rules are
more complicated, before any page-table may be freed the TLB needs to be
flushed.

In short: the generic RCU page-table freeing code is incorrect for the
CRST entries, in particular the check for mm_users < 2 is troublesome.

This is applicable to 3.0+ kernels.

Cc: <stable@vger.kernel.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a0616cde 28-Mar-2012 David Howells <dhowells@redhat.com>

Disintegrate asm/system.h for S390

Disintegrate asm/system.h for S390.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-s390@vger.kernel.org


# 2320c579 17-Feb-2012 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] incorrect PageTables counter for kvm page tables

The page_table_free_pgste function is used for kvm processes to free page
tables that have the pgste extension. It calls pgtable_page_ctor instead of
pgtable_page_dtor which increases NR_PAGETABLE instead of decreasing it.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 14045ebf 27-Dec-2011 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] add support for physical memory > 4TB

The kernel address space of a 64 bit kernel currently uses a three level
page table and the vmemmap array has a fixed address and a fixed maximum
size. A three level page table is good enough for systems with less than
3.8TB of memory, for bigger systems four page table levels need to be
used. Each page table level costs a bit of performance, use 3 levels for
normal systems and 4 levels only for the really big systems.
To avoid bloating sparse.o too much set MAX_PHYSMEM_BITS to 46 for a
maximum of 64TB of memory.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# c86cce2a 27-Dec-2011 Christian Borntraeger <borntraeger@de.ibm.com>

[S390] kvm: fix sleeping function ... at mm/page_alloc.c:2260

commit cc772456ac9b460693492b3a3d89e8c81eda5874
[S390] fix list corruption in gmap reverse mapping

added a potential dead lock:

BUG: sleeping function called from invalid context at mm/page_alloc.c:2260
in_atomic(): 1, irqs_disabled(): 0, pid: 1108, name: qemu-system-s39
3 locks held by qemu-system-s39/1108:
#0: (&kvm->slots_lock){+.+.+.}, at: [<000003e004866542>] kvm_set_memory_region+0x3a/0x6c [kvm]
#1: (&mm->mmap_sem){++++++}, at: [<0000000000123790>] gmap_map_segment+0x9c/0x298
#2: (&(&mm->page_table_lock)->rlock){+.+.+.}, at: [<00000000001237a8>] gmap_map_segment+0xb4/0x298
CPU: 0 Not tainted 3.1.3 #45
Process qemu-system-s39 (pid: 1108, task: 00000004f8b3cb30, ksp: 00000004fd5978d0)
00000004fd5979a0 00000004fd597920 0000000000000002 0000000000000000
00000004fd5979c0 00000004fd597938 00000004fd597938 0000000000617e96
0000000000000000 00000004f8b3cf58 0000000000000000 0000000000000000
000000000000000d 000000000000000c 00000004fd597988 0000000000000000
0000000000000000 0000000000100a18 00000004fd597920 00000004fd597960
Call Trace:
([<0000000000100926>] show_trace+0xee/0x144)
[<0000000000131f3a>] __might_sleep+0x12a/0x158
[<0000000000217fb4>] __alloc_pages_nodemask+0x224/0xadc
[<0000000000123086>] gmap_alloc_table+0x46/0x114
[<000000000012395c>] gmap_map_segment+0x268/0x298
[<000003e00486b014>] kvm_arch_commit_memory_region+0x44/0x6c [kvm]
[<000003e004866414>] __kvm_set_memory_region+0x3b0/0x4a4 [kvm]
[<000003e004866554>] kvm_set_memory_region+0x4c/0x6c [kvm]
[<000003e004867c7a>] kvm_vm_ioctl+0x14a/0x314 [kvm]
[<0000000000292100>] do_vfs_ioctl+0x94/0x588
[<0000000000292688>] SyS_ioctl+0x94/0xac
[<000000000061e124>] sysc_noemu+0x22/0x28
[<000003fffcd5e7ca>] 0x3fffcd5e7ca
3 locks held by qemu-system-s39/1108:
#0: (&kvm->slots_lock){+.+.+.}, at: [<000003e004866542>] kvm_set_memory_region+0x3a/0x6c [kvm]
#1: (&mm->mmap_sem){++++++}, at: [<0000000000123790>] gmap_map_segment+0x9c/0x298
#2: (&(&mm->page_table_lock)->rlock){+.+.+.}, at: [<00000000001237a8>] gmap_map_segment+0xb4/0x298

Fix this by freeing the lock on the alloc path. This is ok, since the
gmap table is never freed until we call gmap_free, so the table we are
walking cannot go.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 388186bc 30-Oct-2011 Christian Borntraeger <borntraeger@de.ibm.com>

[S390] kvm: Handle diagnose 0x10 (release pages)

Linux on System z uses a ballooner based on diagnose 0x10. (aka as
collaborative memory management). This patch implements diagnose
0x10 on the guest address space.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 499069e1 30-Oct-2011 Carsten Otte <cotte@de.ibm.com>

[S390] take mmap_sem when walking guest page table

gmap_fault needs to walk the guest page table. However, parts of
that may change if some other thread does munmap. In that case
gmap_unmap_notifier will also unmap the corresponding parts from
the guest page table. We need to take mmap_sem in order to serialize
these operations.
do_exception now calls __gmap_fault with mmap_sem held which does
not get exported to modules. The exported function, which is called
from KVM, now takes mmap_sem.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# cc772456 30-Oct-2011 Carsten Otte <cotte@de.ibm.com>

[S390] fix list corruption in gmap reverse mapping

This introduces locking via mm->page_table_lock to protect
the rmap list for guest mappings from being corrupted by concurrent
operations.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a9162f23 30-Oct-2011 Carsten Otte <cotte@de.ibm.com>

[S390] fix possible deadlock in gmap_map_segment

Fix possible deadlock reported by lockdep:
qemu-system-s39/2963 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: gmap_alloc_table+0x9c/0x120
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: gmap_map_segment+0xa6/0x27c

Actually gmap_alloc_table is the only called in gmap_map_segment with
mmap_sem held, thus it's safe to simply remove the inner lock.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# e73b7fff 30-Oct-2011 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] memory leak with RCU_TABLE_FREE

The rcu page table free code uses a couple of bits in the page table
pointer passed to tlb_remove_table to discern the different page table
types. __tlb_remove_table extracts the type with an incorrect mask which
leads to memory leaks. The correct mask is ((FRAG_MASK << 4) | FRAG_MASK).

Cc: stable@kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 05873df9 26-Sep-2011 Carsten Otte <cotte@de.ibm.com>

[S390] gmap: always up mmap_sem properly

If gmap_unmap_segment figures that the segment was not mapped in the
first place, it need to up mmap_sem on exit.

Cc: <stable@kernel.org>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 480e5926 20-Sep-2011 Christian Borntraeger <borntraeger@de.ibm.com>

[S390] kvm: fix address mode switching

598841ca9919d008b520114d8a4378c4ce4e40a1 ([S390] use gmap address
spaces for kvm guest images) changed kvm to use a separate address
space for kvm guests. This address space was switched in __vcpu_run
In some cases (preemption, page fault) there is the possibility that
this address space switch is lost.
The typical symptom was a huge amount of validity intercepts or
random guest addressing exceptions.
Fix this by doing the switch in sie_loop and sie_exit and saving the
address space in the gmap structure itself. Also use the preempt
notifier.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 944291de 03-Aug-2011 Jan Glauber <jan.glauber@gmail.com>

[S390] missing return in page_table_alloc_pgste

Fix the following compile warning for !CONFIG_PGSTE:

CC arch/s390/mm/pgtable.o
arch/s390/mm/pgtable.c: In function ‘page_table_alloc_pgste’:
arch/s390/mm/pgtable.c:531:1: warning: no return statement in function returning non-void [-Wreturn-type]

Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# e5992f2e 24-Jul-2011 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] kvm guest address space mapping

Add code that allows KVM to control the virtual memory layout that
is seen by a guest. The guest address space uses a second page table
that shares the last level pte-tables with the process page table.
If a page is unmapped from the process page table it is automatically
unmapped from the guest page table as well.

The guest address space mapping starts out empty, KVM can map any
individual 1MB segments from the process virtual memory to any 1MB
aligned location in the guest virtual memory. If a target segment in
the process virtual memory does not exist or is unmapped while a
guest mapping exists the desired target address is stored as an
invalid segment table entry in the guest page table.
The population of the guest page table is fault driven.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 36409f63 06-Jun-2011 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] use generic RCU page-table freeing code

Replace the s390 specific rcu page-table freeing code with the
generic variant. This requires to duplicate the definition for the
struct mmu_table_batch as s390 does not use the generic tlb flush
code.

While we are at it remove the restriction that page table fragments
can not be reused after a single fragment has been freed with rcu
and split out allocation and freeing of page tables with pgstes.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 3c5cffb6 28-May-2011 Heiko Carstens <hca@linux.ibm.com>

[S390] mm: fix mmu_gather rework

Quite a few functions that get called from the tlb gather code require that
preemption must be disabled. So disable preemption inside of the called
functions instead.
The only drawback is that rcu_table_freelist_finish() doesn't get necessarily
called on the cpu(s) that filled the free lists. So we may see a delay, until
we finally see an rcu callback. However over time this shouldn't matter.

So we get rid of lots of "BUG: using smp_processor_id() in preemptible"
messages.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 1c395176 24-May-2011 Peter Zijlstra <a.p.zijlstra@chello.nl>

mm: now that all old mmu_gather code is gone, remove the storage

Fold all the mmu_gather rework patches into one for submission

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 043d0708 23-May-2011 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] Remove data execution protection

The noexec support on s390 does not rely on a bit in the page table
entry but utilizes the secondary space mode to distinguish between
memory accesses for instructions vs. data. The noexec code relies
on the assumption that the cpu will always use the secondary space
page table for data accesses while it is running in the secondary
space mode. Up to the z9-109 class machines this has been the case.
Unfortunately this is not true anymore with z10 and later machines.
The load-relative-long instructions lrl, lgrl and lgfrl access the
memory operand using the same addressing-space mode that has been
used to fetch the instruction.
This breaks the noexec mode for all user space binaries compiled
with march=z10 or later. The only option is to remove the current
noexec support.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# f1be77bb 31-Jan-2011 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] pgtable_list corruption

After page_table_free_rcu removed a page from the pgtable_list
page_table_free better not add it again. Otherwise a page_table_alloc
can reuse a page table fragment that is still in the rcu process.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# e05ef9bd 25-Oct-2010 Christian Borntraeger <borntraeger@de.ibm.com>

[S390] kvm: Fix badness at include/asm/mmu_context.h:83

commit 050eef364ad700590a605a0749f825cab4834b1e
[S390] fix tlb flushing vs. concurrent /proc accesses
broke KVM on s390x. On every schedule a
Badness at include/asm/mmu_context.h:83 appears. s390_enable_sie
replaces the mm on the __running__ task, therefore, we have to
increase the attach count of the new mm.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 80217147 25-Oct-2010 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] lockless get_user_pages_fast()

Implement get_user_pages_fast without locking in the fastpath on s390.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# b11b5334 06-Dec-2009 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] Improve address space mode selection.

Introduce user_mode to replace the two variables switch_amode and
s390_noexec. There are three valid combinations of the old values:
1) switch_amode == 0 && s390_noexec == 0
2) switch_amode == 1 && s390_noexec == 0
3) switch_amode == 1 && s390_noexec == 1
They get replaced by
1) user_mode == HOME_SPACE_MODE
2) user_mode == PRIMARY_SPACE_MODE
3) user_mode == SECONDARY_SPACE_MODE
The new kernel parameter user_mode=[primary,secondary,home] lets
you choose the address space mode the user space processes should
use. In addition the CONFIG_S390_SWITCH_AMODE config option
is removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 52a21f2c 06-Oct-2009 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] fix build breakage with CONFIG_AIO=n

next-20090925 randconfig build breaks on s390x, with CONFIG_AIO=n.

arch/s390/mm/pgtable.c: In function 's390_enable_sie':
arch/s390/mm/pgtable.c:282: error: 'struct mm_struct' has no member named 'ioctx_list'
arch/s390/mm/pgtable.c:298: error: 'struct mm_struct' has no member named 'ioctx_list'
make[1]: *** [arch/s390/mm/pgtable.o] Error 1

Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 87458ff4 22-Sep-2009 Heiko Carstens <hca@linux.ibm.com>

[S390] Change kernel_page_present coding style.

Make the inline assembly look like all others.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 50aa98ba 11-Sep-2009 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] fix recursive locking on page_table_lock

Suzuki Poulose reported the following recursive locking bug on s390:

Here is the stack trace : (see Appendix I for more info)

[<0000000000406ed6>] _spin_lock+0x52/0x94
[<0000000000103bde>] crst_table_free+0x14e/0x1a4
[<00000000001ba684>] __pmd_alloc+0x114/0x1ec
[<00000000001be8d0>] handle_mm_fault+0x2cc/0xb80
[<0000000000407d62>] do_dat_exception+0x2b6/0x3a0
[<0000000000114f8c>] sysc_return+0x0/0x8
[<00000200001642b2>] 0x200001642b2

The page_table_lock is already acquired in __pmd_alloc (mm/memory.c) and
it tries to populate the pud/pgd with a new pmd allocated. If another
thread populates it before we get a chance, we free the pmd using
pmd_free().

On s390x, pmd_free(even pud_free ) is #defined to crst_table_free(),
which acquires the page_table_lock to protect the crst_table index updates.

Hence this ends up in a recursive locking of the page_table_lock.

The solution suggested by Dave Hansen is to use a new spin lock in the mmu
context to protect the access to the crst_list and the pgtable_list.

Reported-by: Suzuki Poulose <suzuki@in.ibm.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 7db11a36 16-Jun-2009 Hans-Joachim Picht <hans@linux.vnet.ibm.com>

[S390] pm: add kernel_page_present

Fix the following build failure caused by make allyesconfig using
CONFIG_HIBERNATION and CONFIG_DEBUG_PAGEALLOC

kernel/built-in.o: In function `saveable_page':
kernel/power/snapshot.c:897: undefined reference to `kernel_page_present'
kernel/built-in.o: In function `safe_copy_page':
kernel/power/snapshot.c:948: undefined reference to `kernel_page_present'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Hans-Joachim Picht <hans@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 239a6425 12-Jun-2009 Heiko Carstens <hca@linux.ibm.com>

[S390] vmalloc: add vmalloc kernel parameter support

With the kernel parameter 'vmalloc=<size>' the size of the vmalloc area
can be specified. This can be used to increase or decrease the size of
the area. Works in the same way as on some other architectures.
This can be useful for features which make excessive use of vmalloc and
wouldn't work otherwise.
The default sizes remain unchanged: 96MB for 31 bit kernels and 1GB for
64 bit kernels.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 005f8eee 26-Mar-2009 Rusty Russell <rusty@rustcorp.com.au>

[S390] cpumask: use mm_cpumask() wrapper

Makes code futureproof against the impending change to mm->cpu_vm_mask.

It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 702d9e58 26-Mar-2009 Carsten Otte <cotte@de.ibm.com>

[S390] check addressing mode in s390_enable_sie

The sie instruction requires address spaces to be switched
to run proper. This patch verifies that this is the case
in s390_enable_sie, otherwise the kernel would crash badly
as soon as the process runs into sie.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# f481bfaf 18-Mar-2009 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] make page table walking more robust

Make page table walking on s390 more robust. The current code requires
that the pgd/pud/pmd/pte loop is only done for address ranges that are
below the end address of the last vma of the address space. But this
is not always true, e.g. the generic page table walker does not guarantee
this. Change TASK_SIZE/TASK_SIZE_OF to reflect the current size of the
address space. This makes the generic page table walker happy but it
breaks the upgrade of a 3 level page table to a 4 level page table.
To make the upgrade work again another fix is required.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# abf137dd 09-Dec-2008 Jens Axboe <jens.axboe@oracle.com>

aio: make the lookup_ioctx() lockless

The mm->ioctx_list is currently protected by a reader-writer lock,
so we always grab that lock on the read side for doing ioctx
lookups. As the workload is extremely reader biased, turn this into
an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
the rwlock and use a spinlock for providing update side exclusion.

There's usually only 1 entry on this list, so it doesn't make sense
to look into fancier data structures.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 250cf776 28-Oct-2008 Christian Borntraeger <borntraeger@de.ibm.com>

[S390] pgtables: Fix race in enable_sie vs. page table ops

The current enable_sie code sets the mm->context.pgstes bit to tell
dup_mm that the new mm should have extended page tables. This bit is also
used by the s390 specific page table primitives to decide about the page
table layout - which means context.pgstes has two meanings. This can cause
any kind of bugs. For example - e.g. shrink_zone can call
ptep_clear_flush_young while enable_sie is running. ptep_clear_flush_young
will test for context.pgstes. Since enable_sie changed that value of the old
struct mm without changing the page table layout ptep_clear_flush_young will
do the wrong thing.
The solution is to split pgstes into two bits
- one for the allocation
- one for the current state

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 74b6b522 21-May-2008 Christian Borntraeger <borntraeger@de.ibm.com>

KVM: s390: fix locking order problem in enable_sie

There are potential locking problem in enable_sie. We take the task_lock
and the mmap_sem. As exit_mm uses the same locks vice versa, this triggers
a lockdep warning.
The second problem is that dup_mm and mmput might sleep, so we must not
hold the task_lock at that moment.

The solution is to dup the mm unconditional and use the task_lock before and
afterwards to check if we can use the new mm. dup_mm and mmput are called
outside the task_lock, but we run update_mm while holding the task_lock,
protection us against ptrace.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>


# 402b0862 25-Mar-2008 Carsten Otte <cotte@de.ibm.com>

s390: KVM preparation: provide hook to enable pgstes in user pagetable

The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.

Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>


# 6252d702 09-Feb-2008 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] dynamic page tables.

Add support for different number of page table levels dependent
on the highest address used for a process. This will cause a 31 bit
process to use a two level page table instead of the four level page
table that is the default after the pud has been introduced. Likewise
a normal 64 bit process will use three levels instead of four. Only
if a process runs out of the 4 tera bytes which can be addressed with
a three level page table the fourth level is dynamically added. Then
the process can use up to 8 peta byte.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 146e4b3c 09-Feb-2008 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] 1K/2K page table pages.

This patch implements 1K/2K page table pages for s390.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 2f569afd 08-Feb-2008 Martin Schwidefsky <schwidefsky@de.ibm.com>

CONFIG_HIGHPTE vs. sub-page page tables.

Background: I've implemented 1K/2K page tables for s390. These sub-page
page tables are required to properly support the s390 virtualization
instruction with KVM. The SIE instruction requires that the page tables
have 256 page table entries (pte) followed by 256 page status table entries
(pgste). The pgstes are only required if the process is using the SIE
instruction. The pgstes are updated by the hardware and by the hypervisor
for a number of reasons, one of them is dirty and reference bit tracking.
To avoid wasting memory the standard pte table allocation should return
1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means
the s390 version for pte_alloc_one cannot return a pointer to a struct
page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
cannot return a pointer to a pte either, since that would require more than
32 bit for the return value of pte_alloc_one (and the pte * would not be
accessible since its not kmapped).

Solution: The only solution I found to this dilemma is a new typedef: a
pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a
later patch. For everybody else it will be a (struct page *). The
additional problem with the initialization of the ptl lock and the
NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
a destructor pgtable_page_dtor. The page table allocation and free
functions need to call these two whenever a page table page is allocated or
freed. pmd_populate will get a pgtable_t instead of a struct page pointer.
To get the pgtable_t back from a pmd entry that has been installed with
pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page
call in free_pte_range and apply_to_pte_range.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3610cce8 21-Oct-2007 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] Cleanup page table definitions.

- De-confuse the defines for the address-space-control-elements
and the segment/region table entries.
- Create out of line functions for page table allocation / freeing.
- Simplify get_shadow_xxx functions.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>