History log of /linux-master/arch/powerpc/kvm/book3s_64_mmu_radix.c
Revision Date Author Comments
# bd18b688 04-Mar-2024 Peter Xu <peterx@redhat.com>

mm/powerpc: replace pXd_is_leaf() with pXd_leaf()

They're the same macros underneath. Drop pXd_is_leaf(), instead always use
pXd_leaf().

At the meantime, instead of renames, drop the pXd_is_leaf() fallback
definitions directly in arch/powerpc/include/asm/pgtable.h. because
similar fallback macros for pXd_leaf() are already defined in
include/linux/pgtable.h.

Link: https://lkml.kernel.org/r/20240305043750.93762-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 4bc8ff6f 01-Dec-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST

H_COPY_TOFROM_GUEST is part of the nestedv1 API and so should not be
called by a nestedv2 host. Do not attempt to call it.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231201132618.555031-10-vaibhav@linux.ibm.com


# e678748a 01-Dec-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Book3S HV nestedv2: Get the PID only if needed to copy tofrom a guest

kvmhv_copy_tofrom_guest_radix() gets the PID at the start of the
function. If pid is not used, then this is a wasteful H_GUEST_GET_STATE
hcall for nestedv2 hosts. Move the assignment to where pid will be used.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231201132618.555031-5-vaibhav@linux.ibm.com


# dfcaacc8 13-Sep-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long

The LPID register is 32 bits long. The host keeps the lpids for each
guest in an unsigned word struct kvm_arch. Currently, LPIDs are already
limited by mmu_lpid_bits and KVM_MAX_NESTED_GUESTS_SHIFT.

The nestedv2 API returns a 64 bit "Guest ID" to be used be the L1 host
for each L2 guest. This value is used as an lpid, e.g. it is the
parameter used by H_RPT_INVALIDATE. To minimize needless special casing
it makes sense to keep this "Guest ID" in struct kvm_arch::lpid.

This means that struct kvm_arch::lpid is too small so prepare for this
and make it an unsigned long. This is not a problem for the KVM-HV and
nestedv1 cases as their lpid values are already limited to valid ranges
so in those contexts the lpid can be used as an unsigned word safely as
needed.

In the PAPR, the H_RPT_INVALIDATE pid/lpid parameter is already
specified as an unsigned long so change pseries_rpt_invalidate() to
match that. Update the callers of pseries_rpt_invalidate() to also take
an unsigned long if they take an lpid value.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230914030600.16993-10-jniethe5@gmail.com


# ebc88ea7 13-Sep-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Book3S HV: Use accessors for VCPU registers

Introduce accessor generator macros for Book3S HV VCPU registers. Use
the accessor functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230914030600.16993-7-jniethe5@gmail.com


# 7028ac8d 13-Sep-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Use accessors for VCPU registers

Introduce accessor generator macros for VCPU registers. Use the accessor
functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230914030600.16993-5-jniethe5@gmail.com


# d00ae31f 08-Jun-2023 Hugh Dickins <hughd@google.com>

powerpc: kvmppc_unmap_free_pmd() pte_offset_kernel()

kvmppc_unmap_free_pmd() use pte_offset_kernel(), like everywhere else
in book3s_64_mmu_radix.c: instead of pte_offset_map(), which will come
to need a pte_unmap() to balance it.

But note that this is a more complex case than most: see those -EAGAINs
in kvmppc_create_pte(), which is coping with kvmppc races beween page
table and huge entry, of the kind which we are expecting to address
in pte_offset_map() - this might want to be revisited in future.

Link: https://lkml.kernel.org/r/c76aa421-aec3-4cc8-cc61-4130f2e27e1@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6cd5c1db 30-Mar-2023 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Set SRR1[PREFIX] bit on injected interrupts

Pass the hypervisor (H)SRR1[PREFIX] indication through to synchronous
interrupts injected into the guest.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230330103224.3589928-3-npiggin@gmail.com


# 460ba21d 30-Mar-2023 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Permit SRR1 flags in more injected interrupt types

The prefix architecture in ISA v3.1 introduces a prefixed bit in SRR1
for many types of synchronous interrupts which is set when the interrupt
is caused by a prefixed instruction.

This requires KVM to be able to set this bit when injecting interrupts
into a guest. Plumb through the SRR1 "flags" argument to the core_queue
APIs where it's missing for this. For now they are set to 0, which is
no change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fixup kvmppc_core_queue_alignment() in booke.c]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230330103224.3589928-2-npiggin@gmail.com


# c8b88b33 11-Oct-2022 Peter Xu <peterx@redhat.com>

kvm: Add interruptible flag to __gfn_to_pfn_memslot()

Add a new "interruptible" flag showing that the caller is willing to be
interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it
up with a FOLL_INTERRUPTIBLE flag that we've just introduced.

This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
SIGIPI) even during e.g. handling an userfaultfd page fault.

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221011195809.557016-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 20ec3ebd 16-Aug-2022 Chao Peng <chao.p.peng@linux.intel.com>

KVM: Rename mmu_notifier_* to mmu_invalidate_*

The motivation of this renaming is to make these variables and related
helper functions less mmu_notifier bound and can also be used for non
mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
better describe the purpose of 'invalidating' a page that those
variables are used for.

- mmu_notifier_seq/range_start/range_end are renamed to
mmu_invalidate_seq/range_start/range_end.

- mmu_notifier_retry{_hva} helper functions are renamed to
mmu_invalidate_retry{_hva}.

- mmu_notifier_count is renamed to mmu_invalidate_in_progress to
avoid confusion with mn_active_invalidate_count.

- While here, also update kvm_inc/dec_notifier_count() to
kvm_mmu_invalidate_begin/end() to match the change for
mmu_notifier_count.

No functional change intended.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 46d60bdb 06-May-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Include asm/firmware.h in all users of firmware_has_feature()

Trying to remove asm/ppc_asm.h from all places that don't need it
leads to several failures linked to firmware_has_feature().

To fix it, include asm/firmware.h in all files using
firmware_has_feature()

All users found with:

git grep -L "firmware\.h" ` git grep -l "firmware_has_feature("`

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/11956ec181a034b51a881ac9c059eea72c679a73.1651828453.git.christophe.leroy@csgroup.eu


# 2031f287 14-Apr-2022 Sean Christopherson <seanjc@google.com>

KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused

Add wrappers to acquire/release KVM's SRCU lock when stashing the index
in vcpu->src_idx, along with rudimentary detection of illegal usage,
e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx. Because the
SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
unnoticed for quite some time and only cause problems when the nested
lock happens to get a different index.

Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
likely yell so loudly that it will bring the kernel to its knees.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Fabiano Rosas <farosas@linux.ibm.com>
Message-Id: <20220415004343.2203171-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# faf01aef 10-Jan-2022 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Merge powerpc's debugfs entry content into generic entry

At the moment KVM on PPC creates 4 types of entries under the kvm debugfs:
1) "%pid-%fd" per a KVM instance (for all platforms);
2) "vm%pid" (for PPC Book3s HV KVM);
3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM);
4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS);

The problem with this is that multiple VMs per process is not allowed for
2) and 3) which makes it possible for userspace to trigger errors when
creating duplicated debugfs entries.

This merges all these into 1).

This defines kvm_arch_create_kvm_debugfs() similar to
kvm_arch_create_vcpu_debugfs().

This defines 2 hooks in kvmppc_ops that allow specific KVM implementations
add necessary entries, this adds the _e500 suffix to
kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for.

This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

This stops removing vcpu entries as once created vcpus stay around
for the entire life of a VM and removed when the KVM instance is closed,
see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU
debugfs directories").

Suggested-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220111005404.162219-1-aik@ozlabs.ru


# cf3b16cf 23-Nov-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: Comment and fix MMU context switching code

Tighten up partition switching code synchronisation and comments.

In particular, hwsync ; isync is required after the last access that is
performed in the context of a partition, before the partition is
switched away from.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-40-npiggin@gmail.com


# 0eb596f1 05-Aug-2021 Fabiano Rosas <farosas@linux.ibm.com>

KVM: PPC: Book3S HV: Stop exporting symbols from book3s_64_mmu_radix

The book3s_64_mmu_radix.o object is not part of the KVM builtins and
all the callers of the exported symbols are in the same kvm-hv.ko
module so we should not need to export any symbols.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210805212616.2641017-4-farosas@linux.ibm.com


# c232461c 05-Aug-2021 Fabiano Rosas <farosas@linux.ibm.com>

KVM: PPC: Book3S HV: Add sanity check to copy_tofrom_guest

Both paths into __kvmhv_copy_tofrom_guest_radix ensure that we arrive
with an effective address that is smaller than our total addressable
space and addresses quadrant 0.

- The H_COPY_TOFROM_GUEST hypercall path rejects the call with
H_PARAMETER if the effective address has any of the twelve most
significant bits set.

- The kvmhv_copy_tofrom_guest_radix path clears the top twelve bits
before calling the internal function.

Although the callers make sure that the effective address is sane, any
future use of the function is exposed to a programming error, so add a
sanity check.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210805212616.2641017-3-farosas@linux.ibm.com


# 5d7d6dac 05-Aug-2021 Fabiano Rosas <farosas@linux.ibm.com>

KVM: PPC: Book3S HV: Fix copy_tofrom_guest routines

The __kvmhv_copy_tofrom_guest_radix function was introduced along with
nested HV guest support. It uses the platform's Radix MMU quadrants to
provide a nested hypervisor with fast access to its nested guests
memory (H_COPY_TOFROM_GUEST hypercall). It has also since been added
as a fast path for the kvmppc_ld/st routines which are used during
instruction emulation.

The commit def0bfdbd603 ("powerpc: use probe_user_read() and
probe_user_write()") changed the low level copy function from
raw_copy_from_user to probe_user_read, which adds a check to
access_ok. In powerpc that is:

static inline bool __access_ok(unsigned long addr, unsigned long size)
{
return addr < TASK_SIZE_MAX && size <= TASK_SIZE_MAX - addr;
}

and TASK_SIZE_MAX is 0x0010000000000000UL for 64-bit, which means that
setting the two MSBs of the effective address (which correspond to the
quadrant) now cause access_ok to reject the access.

This was not caught earlier because the most common code path via
kvmppc_ld/st contains a fallback (kvm_read_guest) that is likely to
succeed for L1 guests. For nested guests there is no fallback.

Another issue is that probe_user_read (now __copy_from_user_nofault)
does not return the number of bytes not copied in case of failure, so
the destination memory is not being cleared anymore in
kvmhv_copy_from_guest_radix:

ret = kvmhv_copy_tofrom_guest_radix(vcpu, eaddr, to, NULL, n);
if (ret > 0) <-- always false!
memset(to + (n - ret), 0, ret);

This patch fixes both issues by skipping access_ok and open-coding the
low level __copy_to/from_user_inatomic.

Fixes: def0bfdbd603 ("powerpc: use probe_user_read() and probe_user_write()")
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210805212616.2641017-2-farosas@linux.ibm.com


# 81468083 21-Jun-2021 Bharata B Rao <bharata@linux.ibm.com>

KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
H_RPT_INVALIDATE if available. The availability of this hcall
is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
DT property.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210621085003.904767-7-bharata@linux.ibm.com


# 34114136 05-May-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks

Commit b1c5356e873c ("KVM: PPC: Convert to the gfn-based MMU notifier
callbacks") causes unmap_gfn_range and age_gfn callbacks to only work
on the first gfn in the range. It also makes the aging callbacks call
into both radix and hash aging functions for radix guests. Fix this.

Add warnings for the single-gfn calls that have been converted to range
callbacks, in case they ever receieve ranges greater than 1.

Fixes: b1c5356e873c ("KVM: PPC: Convert to the gfn-based MMU notifier callbacks")
Reported-by: Bharata B Rao <bharata@linux.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20210505121509.1470207-1-npiggin@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 32b48bf8 05-May-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks

Commit b1c5356e873c ("KVM: PPC: Convert to the gfn-based MMU notifier
callbacks") causes unmap_gfn_range and age_gfn callbacks to only work
on the first gfn in the range. It also makes the aging callbacks call
into both radix and hash aging functions for radix guests. Fix this.

Add warnings for the single-gfn calls that have been converted to range
callbacks, in case they ever receieve ranges greater than 1.

Fixes: b1c5356e873c ("KVM: PPC: Convert to the gfn-based MMU notifier callbacks")
Reported-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210505121509.1470207-1-npiggin@gmail.com


# b1c5356e 01-Apr-2021 Sean Christopherson <seanjc@google.com>

KVM: PPC: Convert to the gfn-based MMU notifier callbacks

Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion
PPC-internal hooks to work with gfns instead of hvas.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4a42d848 21-Feb-2021 David Stevens <stevensd@chromium.org>

KVM: x86/mmu: Consider the hva in mmu_notifier retry

Track the range being invalidated by mmu_notifier and skip page fault
retries if the fault address is not affected by the in-progress
invalidation. Handle concurrent invalidations by finding the minimal
range which includes all ranges being invalidated. Although the combined
range may include unrelated addresses and cannot be shrunk as individual
invalidation operations complete, it is unlikely the marginal gains of
proper range tracking are worth the additional complexity.

The primary benefit of this change is the reduction in the likelihood of
extreme latency when handing a page fault due to another thread having
been preempted while modifying host virtual addresses.

Signed-off-by: David Stevens <stevensd@chromium.org>
Message-Id: <20210222024522.1751719-3-stevensd@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cf59eb13 21-Sep-2020 Wang Wensheng <wangwensheng4@huawei.com>

KVM: PPC: Book3S: Fix symbol undeclared warnings

Build the kernel with `C=2`:
arch/powerpc/kvm/book3s_hv_nested.c:572:25: warning: symbol
'kvmhv_alloc_nested' was not declared. Should it be static?
arch/powerpc/kvm/book3s_64_mmu_radix.c:350:6: warning: symbol
'kvmppc_radix_set_pte_at' was not declared. Should it be static?
arch/powerpc/kvm/book3s_hv.c:3568:5: warning: symbol
'kvmhv_p9_guest_entry' was not declared. Should it be static?
arch/powerpc/kvm/book3s_hv_rm_xics.c:767:15: warning: symbol 'eoi_rc'
was not declared. Should it be static?
arch/powerpc/kvm/book3s_64_vio_hv.c:240:13: warning: symbol
'iommu_tce_kill_rm' was not declared. Should it be static?
arch/powerpc/kvm/book3s_64_vio.c:492:6: warning: symbol
'kvmppc_tce_iommu_do_map' was not declared. Should it be static?
arch/powerpc/kvm/book3s_pr.c:572:6: warning: symbol 'kvmppc_set_pvr_pr'
was not declared. Should it be static?

Those symbols are used only in the files that define them so make them
static to fix the warnings.

Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 1508c22f 08-Jun-2020 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Protect kvm_vcpu_read_guest with srcu locks

The kvm_vcpu_read_guest/kvm_vcpu_write_guest used for nested guests
eventually call srcu_dereference_check to dereference a memslot and
lockdep produces a warning as neither kvm->slots_lock nor
kvm->srcu lock is held and kvm->users_count is above zero (>100 in fact).

This wraps mentioned VCPU read/write helpers in srcu read lock/unlock as
it is done in other places. This uses vcpu->srcu_idx when possible.

These helpers are only used for nested KVM so this may explain why
we did not see these before.

Here is an example of a warning:

=============================
WARNING: suspicious RCU usage
5.7.0-rc3-le_dma-bypass.3.2_a+fstn1 #897 Not tainted
-----------------------------
include/linux/kvm_host.h:633 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by qemu-system-ppc/2752:
#0: c000200359016be0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x144/0xd80 [kvm]

stack backtrace:
CPU: 80 PID: 2752 Comm: qemu-system-ppc Not tainted 5.7.0-rc3-le_dma-bypass.3.2_a+fstn1 #897
Call Trace:
[c0002003591ab240] [c000000000b23ab4] dump_stack+0x190/0x25c (unreliable)
[c0002003591ab2b0] [c00000000023f954] lockdep_rcu_suspicious+0x140/0x164
[c0002003591ab330] [c008000004a445f8] kvm_vcpu_gfn_to_memslot+0x4c0/0x510 [kvm]
[c0002003591ab3a0] [c008000004a44c18] kvm_vcpu_read_guest+0xa0/0x180 [kvm]
[c0002003591ab410] [c008000004ff9bd8] kvmhv_enter_nested_guest+0x90/0xb80 [kvm_hv]
[c0002003591ab980] [c008000004fe07bc] kvmppc_pseries_do_hcall+0x7b4/0x1c30 [kvm_hv]
[c0002003591aba10] [c008000004fe5d30] kvmppc_vcpu_run_hv+0x10a8/0x1a30 [kvm_hv]
[c0002003591abae0] [c008000004a5d954] kvmppc_vcpu_run+0x4c/0x70 [kvm]
[c0002003591abb10] [c008000004a56e54] kvm_arch_vcpu_ioctl_run+0x56c/0x7c0 [kvm]
[c0002003591abba0] [c008000004a3ddc4] kvm_vcpu_ioctl+0x4ac/0xd80 [kvm]
[c0002003591abd20] [c0000000006ebb58] ksys_ioctl+0x188/0x210
[c0002003591abd70] [c0000000006ebc28] sys_ioctl+0x48/0xb0
[c0002003591abdb0] [c000000000042764] system_call_exception+0x1d4/0x2e0
[c0002003591abe20] [c00000000000cce8] system_call_common+0xe8/0x214

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 3f649ab7 03-Jun-2020 Kees Cook <keescook@chromium.org>

treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>


# c1ed1754 11-Jun-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/kvm/book3s64: Fix kernel crash with nested kvm & DEBUG_VIRTUAL

With CONFIG_DEBUG_VIRTUAL=y, __pa() checks for addr value and if it's
less than PAGE_OFFSET it leads to a BUG().

#define __pa(x)
({
VIRTUAL_BUG_ON((unsigned long)(x) < PAGE_OFFSET);
(unsigned long)(x) & 0x0fffffffffffffffUL;
})

kernel BUG at arch/powerpc/kvm/book3s_64_mmu_radix.c:43!
cpu 0x70: Vector: 700 (Program Check) at [c0000018a2187360]
pc: c000000000161b30: __kvmhv_copy_tofrom_guest_radix+0x130/0x1f0
lr: c000000000161d5c: kvmhv_copy_from_guest_radix+0x3c/0x80
...
kvmhv_copy_from_guest_radix+0x3c/0x80
kvmhv_load_from_eaddr+0x48/0xc0
kvmppc_ld+0x98/0x1e0
kvmppc_load_last_inst+0x50/0x90
kvmppc_hv_emulate_mmio+0x288/0x2b0
kvmppc_book3s_radix_page_fault+0xd8/0x2b0
kvmppc_book3s_hv_page_fault+0x37c/0x1050
kvmppc_vcpu_run_hv+0xbb8/0x1080
kvmppc_vcpu_run+0x34/0x50
kvm_arch_vcpu_ioctl_run+0x2fc/0x410
kvm_vcpu_ioctl+0x2b4/0x8f0
ksys_ioctl+0xf4/0x150
sys_ioctl+0x28/0x80
system_call_exception+0x104/0x1d0
system_call_common+0xe8/0x214

kvmhv_copy_tofrom_guest_radix() uses a NULL value for to/from to
indicate direction of copy.

Avoid calling __pa() if the value is NULL to avoid the BUG().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Massage change log a bit to mention CONFIG_DEBUG_VIRTUAL]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200611120159.680284-1-aneesh.kumar@linux.ibm.com


# c0ee37e8 17-Jun-2020 Christoph Hellwig <hch@lst.de>

maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault

Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65fddcfc 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: reorder includes after introduction of linux/pgtable.h

The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include
of the latter in the middle of asm includes. Fix this up with the aid of
the below script and manual adjustments here and there.

import sys
import re

if len(sys.argv) is not 3:
print "USAGE: %s <file> <header>" % (sys.argv[0])
sys.exit(1)

hdr_to_move="#include <linux/%s>" % sys.argv[2]
moved = False
in_hdrs = False

with open(sys.argv[1], "r") as f:
lines = f.readlines()
for _line in lines:
line = _line.rstrip('
')
if line == hdr_to_move:
continue
if line.startswith("#include <linux/"):
in_hdrs = True
elif not moved and in_hdrs:
moved = True
print hdr_to_move
print line

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca5999fd 08-Jun-2020 Mike Rapoport <rppt@kernel.org>

mm: introduce include/linux/pgtable.h

The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dadbb612 07-Jun-2020 Souptick Joarder <jrdr.linux@gmail.com>

mm/gup.c: convert to use get_user_{page|pages}_fast_only()

API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
align with pin_user_pages_fast_only().

As part of this we will get rid of write parameter. Instead caller will
pass FOLL_WRITE to get_user_pages_fast_only(). This will not change any
existing functionality of the API.

All the callers are changed to pass FOLL_WRITE.

Also introduce get_user_page_fast_only(), and use it in a few places
that hard-code nr_pages to 1.

Updated the documentation of the API.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org> [arch/powerpc/kvm]
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Suchanek <msuchanek@suse.de>
Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2fb47060 04-Jun-2020 Mike Rapoport <rppt@kernel.org>

powerpc: add support for folded p4d page tables

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and replace 5level-fixup.h with pgtable-nop4d.h.

[rppt@linux.ibm.com: powerpc/xmon: drop unused pgdir varialble in show_pte() function]
Link: http://lkml.kernel.org/r/20200519181454.GI1059226@linux.ibm.com
[rppt@linux.ibm.com; build fix]
Link: http://lkml.kernel.org/r/20200423141845.GI13521@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> # 8xx and 83xx
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: James Morse <james.morse@arm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200414153455.21744-9-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bf8036a4 28-May-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/book3s64/kvm: Fix secondary page table walk warning during migration

This patch fixes the below warning reported during migration:

find_kvm_secondary_pte called with kvm mmu_lock not held
CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G W 5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
NIP: c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
REGS: c000001e19f077e0 TRAP: 0700 Tainted: G W (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42222422 XER: 20040000
CFAR: c00000000012f5ac IRQMASK: 0
GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
Call Trace:
[c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
[c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
[c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
[c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
[c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
[c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
[c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
[c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
Instruction dump:
7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000

Reported-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200528080456.87797-1-aneesh.kumar@linux.ibm.com


# 11362b1b 27-May-2020 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Close race with page faults around memslot flushes

There is a potential race condition between hypervisor page faults
and flushing a memslot. It is possible for a page fault to read the
memslot before a memslot is updated and then write a PTE to the
partition-scoped page tables after kvmppc_radix_flush_memslot has
completed. (Note that this race has never been explicitly observed.)

To close this race, it is sufficient to increment the MMU sequence
number while the kvm->mmu_lock is held. That will cause
mmu_notifier_retry() to return true, and the page fault will then
return to the guest without inserting a PTE.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 3d89c2ef 27-May-2020 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Remove user-triggerable WARN_ON

Although in general we do not expect valid PTEs to be found in
kvmppc_create_pte when we are inserting a large page mapping, there
is one situation where this can occur. That is when dirty page
logging is turned off for a memslot while the VM is running.
Because the new memslots are installed before the old memslot is
flushed in kvmppc_core_commit_memory_region_hv(), there is a
window where a hypervisor page fault can try to install a 2MB
(or 1GB) page where there are already small page mappings which
were installed while dirty page logging was enabled and which
have not yet been flushed.

Since we have a situation where valid PTEs can legitimately be
found by kvmppc_unmap_free_pte, and which can be triggered by
userspace, just remove the WARN_ON_ONCE, since it is undesirable
to have userspace able to trigger a kernel warning.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 0aca8a55 13-May-2020 Qian Cai <cai@lca.pw>

KVM: PPC: Book3S HV: Ignore kmemleak false positives

kvmppc_pmd_alloc() and kvmppc_pte_alloc() allocate some memory but then
pud_populate() and pmd_populate() will use __pa() to reference the newly
allocated memory.

Since kmemleak is unable to track the physical memory resulting in false
positives, silence those by using kmemleak_ignore().

unreferenced object 0xc000201c382a1000 (size 4096):
comm "qemu-kvm", pid 124828, jiffies 4295733767 (age 341.250s)
hex dump (first 32 bytes):
c0 00 20 09 f4 60 03 87 c0 00 20 10 72 a0 03 87 .. ..`.... .r...
c0 00 20 0e 13 a0 03 87 c0 00 20 1b dc c0 03 87 .. ....... .....
backtrace:
[<000000004cc2790f>] kvmppc_create_pte+0x838/0xd20 [kvm_hv]
kvmppc_pmd_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:366
(inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:590
[<00000000d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
[<00000000bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
[<0000000086dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
[<000000005ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
[<00000000d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
[<00000000d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
[<000000002543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
[<0000000048155cd6>] ksys_ioctl+0xd8/0x130
[<0000000041ffeaa7>] sys_ioctl+0x28/0x40
[<000000004afc4310>] system_call_exception+0x114/0x1e0
[<00000000fb70a873>] system_call_common+0xf0/0x278
unreferenced object 0xc0002001f0c03900 (size 256):
comm "qemu-kvm", pid 124830, jiffies 4295735235 (age 326.570s)
hex dump (first 32 bytes):
c0 00 20 10 fa a0 03 87 c0 00 20 10 fa a1 03 87 .. ....... .....
c0 00 20 10 fa a2 03 87 c0 00 20 10 fa a3 03 87 .. ....... .....
backtrace:
[<0000000023f675b8>] kvmppc_create_pte+0x854/0xd20 [kvm_hv]
kvmppc_pte_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:356
(inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:593
[<00000000d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
[<00000000bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
[<0000000086dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
[<000000005ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
[<00000000d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
[<00000000d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
[<000000002543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
[<0000000048155cd6>] ksys_ioctl+0xd8/0x130
[<0000000041ffeaa7>] sys_ioctl+0x28/0x40
[<000000004afc4310>] system_call_exception+0x114/0x1e0
[<00000000fb70a873>] system_call_common+0xf0/0x278

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 8c99d345 26-Apr-2020 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>

KVM: PPC: Clean up redundant 'kvm_run' parameters

In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# bda3deaa 04-May-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/kvm/book3s: use find_kvm_host_pte in kvmppc_book3s_instantiate_page

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-18-aneesh.kumar@linux.ibm.com


# 6cdf3037 04-May-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/kvm/book3s: Use kvm helpers to walk shadow or secondary table

update kvmppc_hv_handle_set_rc to use find_kvm_nested_guest_pte and
find_kvm_secondary_pte

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-12-aneesh.kumar@linux.ibm.com


# 4b99412e 04-May-2020 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.

The locking rules for walking partition scoped table is different from process
scoped table. Hence add a helper for secondary linux page table walk and also
add check whether we are holding the right locks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-10-aneesh.kumar@linux.ibm.com


# ae49deda 15-Apr-2020 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Handle non-present PTEs in page fault functions

Since cd758a9b57ee "KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT
page fault handler", it's been possible in fairly rare circumstances to
load a non-present PTE in kvmppc_book3s_hv_page_fault() when running a
guest on a POWER8 host.

Because that case wasn't checked for, we could misinterpret the non-present
PTE as being a cache-inhibited PTE. That could mismatch with the
corresponding hash PTE, which would cause the function to fail with -EFAULT
a little further down. That would propagate up to the KVM_RUN ioctl()
generally causing the KVM userspace (usually qemu) to fall over.

This addresses the problem by catching that case and returning to the guest
instead.

For completeness, this fixes the radix page fault handler in the same
way. For radix this didn't cause any obvious misbehaviour, because we
ended up putting the non-present PTE into the guest's partition-scoped
page tables, leading immediately to another hypervisor data/instruction
storage interrupt, which would go through the page fault path again
and fix things up.

Fixes: cd758a9b57ee "KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT page fault handler"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1820402
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# afd31356 17-Feb-2020 Michael Ellerman <mpe@ellerman.id.au>

KVM: PPC: Book3S HV: Use RADIX_PTE_INDEX_SIZE in Radix MMU code

In kvmppc_unmap_free_pte() in book3s_64_mmu_radix.c, we use the
non-constant value PTE_INDEX_SIZE to clear a PTE page.

We can instead use the constant RADIX_PTE_INDEX_SIZE, because we know
this code will only be running when the Radix MMU is active.

Note that we already use RADIX_PTE_INDEX_SIZE for the allocation of
kvm_pte_cache.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Leonardo Bras <leonardo@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# c4fd527f 09-Feb-2020 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

powerpc/kvm: no need to check return value of debugfs_create functions

When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.

Because of this cleanup, we get to remove a few fields in struct
kvm_arch that are now unused.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[mpe: Fix build error in kvm/timing.c, adapt kvmppc_remove_cpu_debugfs()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200209105901.1620958-2-gregkh@linuxfoundation.org


# def0bfdb 23-Jan-2020 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc: use probe_user_read() and probe_user_write()

Instead of opencoding, use probe_user_read() to failessly read
a user location and probe_user_write() for writing to user.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/e041f5eedb23f09ab553be8a91c3de2087147320.1579800517.git.christophe.leroy@c-s.fr


# ce477a7a 19-Dec-2019 Sukadev Bhattiprolu <sukadev@linux.ibm.com>

KVM: PPC: Add skip_page_out parameter to uvmem functions

Add 'skip_page_out' parameter to kvmppc_uvmem_drop_pages() so the
callers can specify whetheter or not to skip paging out pages. This
will be needed in a follow-on patch that implements H_SVM_INIT_ABORT
hcall.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# c3262257 24-Nov-2019 Bharata B Rao <bharata@linux.ibm.com>

KVM: PPC: Book3S HV: Handle memory plug/unplug to secure VM

Register the new memslot with UV during plug and unregister
the memslot during unplug. In addition, release all the
device pages during unplug.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 008e359c 24-Nov-2019 Bharata B Rao <bharata@linux.ibm.com>

KVM: PPC: Book3S HV: Radix changes for secure guest

- After the guest becomes secure, when we handle a page fault of a page
belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
- Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
- Ensure all those routines that walk the secondary page tables of
the guest don't do so in case of secure VM. For secure guest, the
active secondary page tables are in secure memory and the secondary
page tables in HV are freed when guest becomes secure.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d6eacedd 14-May-2019 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/book3s: Use config independent helpers for page table walk

Even when we have HugeTLB and THP disabled, kernel linear map can still be
mapped with hugepages. This is only an issue with radix translation because hash
MMU doesn't map kernel linear range in linux page table and other kernel
map areas are not mapped using hugepage.

Add config independent helpers and put WARN_ON() when we don't expect things
to be mapped via hugepages.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 8f1f7b9b 18-Feb-2019 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Add KVM stat largepages_[2M/1G]

This adds an entry to the kvm_stats_debugfs directory which provides the
number of large (2M or 1G) pages which have been used to setup the guest
mappings, for radix guests.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# f4607722 29-Dec-2018 Michael Ellerman <mpe@ellerman.id.au>

KVM: PPC: Book3S HV: radix: Fix uninitialized var build error

Old GCCs (4.6.3 at least), aren't able to follow the logic in
__kvmhv_copy_tofrom_guest_radix() and warn that old_pid is used
uninitialized:

arch/powerpc/kvm/book3s_64_mmu_radix.c:75:3: error: 'old_pid' may be
used uninitialized in this function

The logic is OK, we only use old_pid if quadrant == 1, and in that
case it has definitely be initialised, eg:

if (quadrant == 1) {
old_pid = mfspr(SPRN_PID);
...
if (quadrant == 1 && pid != old_pid)
mtspr(SPRN_PID, old_pid);

Annotate it to fix the error.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ae59a7e1 20-Dec-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Keep rc bits in shadow pgtable in sync with host

The rc bits contained in ptes are used to track whether a page has been
accessed and whether it is dirty. The accessed bit is used to age a page
and the dirty bit to track whether a page is dirty or not.

Now that we support nested guests there are three ptes which track the
state of the same page:
- The partition-scoped page table in the L1 guest, mapping L2->L1 address
- The partition-scoped page table in the host for the L1 guest, mapping
L1->L0 address
- The shadow partition-scoped page table for the nested guest in the host,
mapping L2->L0 address

The idea is to attempt to keep the rc state of these three ptes in sync,
both when setting and when clearing rc bits.

When setting the bits we achieve consistency by:
- Initially setting the bits in the shadow page table as the 'and' of the
other two.
- When updating in software the rc bits in the shadow page table we
ensure the state is consistent with the other two locations first, and
update these before reflecting the change into the shadow page table.
i.e. only set the bits in the L2->L0 pte if also set in both the
L2->L1 and the L1->L0 pte.

When clearing the bits we achieve consistency by:
- The rc bits in the shadow page table are only cleared when discarding
a pte, and we don't need to record this as if either bit is set then
it must also be set in the pte mapping L1->L0.
- When L1 clears an rc bit in the L2->L1 mapping it __should__ issue a
tlbie instruction
- This means we will discard the pte from the shadow page table
meaning the mapping will have to be setup again.
- When setup the pte again in the shadow page table we will ensure
consistency with the L2->L1 pte.
- When the host clears an rc bit in the L1->L0 mapping we need to also
clear the bit in any ptes in the shadow page table which map the same
gfn so we will be notified if a nested guest accesses the page.
This case is what this patch specifically concerns.
- We can search the nest_rmap list for that given gfn and clear the
same bit from all corresponding ptes in shadow page tables.
- If a nested guest causes either of the rc bits to be set by software
in future then we will update the L1->L0 pte and maintain consistency.

With the process outlined above we aim to maintain consistency of the 3
pte locations where we track rc for a given guest page.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 90165d3d 20-Dec-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Introduce kvmhv_update_nest_rmap_rc_list()

Introduce a function kvmhv_update_nest_rmap_rc_list() which for a given
nest_rmap list will traverse it, find the corresponding pte in the shadow
page tables, and if it still maps the same host page update the rc bits
accordingly.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 95d386c2 13-Dec-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest

Previously when a device was being emulated by an L1 guest for an L2
guest, that device couldn't then be passed through to an L3 guest. This
was because the L1 guest had no method for accessing L3 memory.

The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
passthrough can now be allowed.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 6ff887b8 13-Dec-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2

A guest cannot access quadrants 1 or 2 as this would result in an
exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
guest when it wants to perform an access to quadrants 1 or 2, for
example when it wants to access memory for one of its nested guests.

Also provide an implementation for the kvm-hv module.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d7b45615 13-Dec-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2

The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.

When accessing these quadrants the fully qualified address is obtained
as follows:

Quadrant | Hypervisor | Guest
--------------------------------------------------------------------------
| EA[0:1] = 0b00 | EA[0:1] = 0b00
0 | effLPID = 0 | effLPID = LPIDR
| effPID = PIDR | effPID = PIDR
--------------------------------------------------------------------------
| EA[0:1] = 0b01 |
1 | effLPID = LPIDR | Invalid Access
| effPID = PIDR |
--------------------------------------------------------------------------
| EA[0:1] = 0b10 |
2 | effLPID = LPIDR | Invalid Access
| effPID = 0 |
--------------------------------------------------------------------------
| EA[0:1] = 0b11 | EA[0:1] = 0b11
3 | effLPID = 0 | effLPID = LPIDR
| effPID = 0 | effPID = 0
--------------------------------------------------------------------------

In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.

In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.

Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.

This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.

Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 5af3e9d0 11-Dec-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/off

This adds code to flush the partition-scoped page tables for a radix
guest when dirty tracking is turned on or off for a memslot. Only the
guest real addresses covered by the memslot are flushed. The reason
for this is to get rid of any 2M PTEs in the partition-scoped page
tables that correspond to host transparent huge pages, so that page
dirtiness is tracked at a system page (4k or 64k) granularity rather
than a 2M granularity. The page tables are also flushed when turning
dirty tracking off so that the memslot's address space can be
repopulated with THPs if possible.

To do this, we add a new function kvmppc_radix_flush_memslot(). Since
this does what's needed for kvmppc_core_flush_memslot_hv() on a radix
guest, we now make kvmppc_core_flush_memslot_hv() call the new
kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix()
for each page in the memslot. This has the effect of fixing a bug in
that kvmppc_core_flush_memslot_hv() was previously calling
kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which
is required to be held.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# c43c3a86 11-Dec-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Cleanups - constify memslots, fix comments

This adds 'const' to the declarations for the struct kvm_memory_slot
pointer parameters of some functions, which will make it possible to
call those functions from kvmppc_core_commit_memory_region_hv()
in the next patch.

This also fixes some comments about locking.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# f460f679 11-Dec-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Map single pages when doing dirty page logging

For radix guests, this makes KVM map guest memory as individual pages
when dirty page logging is enabled for the memslot corresponding to the
guest real address. Having a separate partition-scoped PTE for each
system page mapped to the guest means that we have a separate dirty
bit for each page, thus making the reported dirty bitmap more accurate.
Without this, if part of guest memory is backed by transparent huge
pages, the dirty status is reported at a 2MB granularity rather than
a 64kB (or 4kB) granularity for that part, causing userspace to have
to transmit more data when migrating the guest.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 83a05510 07-Oct-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs

This adds a list of valid shadow PTEs for each nested guest to
the 'radix' file for the guest in debugfs. This can be useful for
debugging.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 690ed4ca 07-Oct-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested

This adds code to call the H_TLB_INVALIDATE hypercall when running as
a guest, in the cases where we need to invalidate TLBs (or other MMU
caches) as part of managing the mappings for a nested guest. Calling
H_TLB_INVALIDATE lets the nested hypervisor inform the parent
hypervisor about changes to partition-scoped page tables or the
partition table without needing to do hypervisor-privileged tlbie
instructions.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8cf531ed 07-Oct-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings

When a host (L0) page which is mapped into a (L1) guest is in turn
mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
so that these mappings can be retrieved later.

Whenever we create an entry in a shadow_pgtable for a nested guest we
create a corresponding rmap entry and add it to the list for the
L1 guest memslot at the index of the L1 guest page it maps. This means
at the L1 guest memslot we end up with lists of rmaps.

When we are notified of a host page being invalidated which has been
mapped through to a (L1) guest, we can then walk the rmap list for that
guest page, and find and invalidate all of the corresponding
shadow_pgtable entries.

In order to reduce memory consumption, we compress the information for
each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
for the guest real page frame number -- which will fit in a single
unsigned long. To avoid a scenario where a guest can trigger
unbounded memory allocations, we scan the list when adding an entry to
see if there is already an entry with the contents we need. This can
occur, because we don't ever remove entries from the middle of a list.

A struct nested guest rmap is a list pointer and an rmap entry;
----------------
| next pointer |
----------------
| rmap entry |
----------------

Thus the rmap pointer for each guest frame number in the memslot can be
either NULL, a single entry, or a pointer to a list of nested rmap entries.

gfn memslot rmap array
-------------------------
0 | NULL | (no rmap entry)
-------------------------
1 | single rmap entry | (rmap entry with low bit set)
-------------------------
2 | list head pointer | (list of rmap entries)
-------------------------

The final entry always has the lowest bit set and is stored in the next
pointer of the last list entry, or as a single rmap entry.
With a list of rmap entries looking like;

----------------- ----------------- -------------------------
| list head ptr | ----> | next pointer | ----> | single rmap entry |
----------------- ----------------- -------------------------
| rmap entry | | rmap entry |
----------------- -------------------------

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# fd10be25 07-Oct-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Handle page fault for a nested guest

Consider a normal (L1) guest running under the main hypervisor (L0),
and then a nested guest (L2) running under the L1 guest which is acting
as a nested hypervisor. L0 has page tables to map the address space for
L1 providing the translation from L1 real address -> L0 real address;

L1
|
| (L1 -> L0)
|
----> L0

There are also page tables in L1 used to map the address space for L2
providing the translation from L2 real address -> L1 read address. Since
the hardware can only walk a single level of page table, we need to
maintain in L0 a "shadow_pgtable" for L2 which provides the translation
from L2 real address -> L0 real address. Which looks like;

L2 L2
| |
| (L2 -> L1) |
| |
----> L1 | (L2 -> L0)
| |
| (L1 -> L0) |
| |
----> L0 --------> L0

When a page fault occurs while running a nested (L2) guest we need to
insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
do this we need to:

1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
provide a page fault to L1 if this mapping doesn't exist.
2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
or try to insert a pte for that mapping if it doesn't exist.
3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable

Once this mapping exists we can take rc faults when hardware is unable
to automatically set the reference and change bits in the pte. On these
we need to:

1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
the fault down to L1.
2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
host page.
3. Set the rc bits in the L2 -> L0 pte.

As we reuse a large number of functions in book3s_64_mmu_radix.c for
this we also needed to refactor a number of these functions to take
an lpid parameter so that the correct lpid is used for tlb invalidations.
The functionality however has remained the same.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# f0f825f0 07-Oct-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Use kvmppc_unmap_pte() in kvm_unmap_radix()

kvmppc_unmap_pte() does a sequence of operations that are open-coded in
kvm_unmap_radix(). This extends kvmppc_unmap_pte() a little so that it
can be used by kvm_unmap_radix(), and makes kvm_unmap_radix() call it.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 04bae9d5 07-Oct-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Refactor radix page fault handler

The radix page fault handler accounts for all cases, including just
needing to insert a pte. This breaks it up into separate functions for
the two main cases; setting rc and inserting a pte.

This allows us to make the setting of rc and inserting of a pte
generic for any pgtable, not specific to the one for this guest.

[paulus@ozlabs.org - reduced diffs from previous code]

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 9811c78e 07-Oct-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Make kvmppc_mmu_radix_xlate process/partition table agnostic

kvmppc_mmu_radix_xlate() is used to translate an effective address
through the process tables. The process table and partition tables have
identical layout. Exploit this fact to make the kvmppc_mmu_radix_xlate()
function able to translate either an effective address through the
process tables or a guest real address through the partition tables.

[paulus@ozlabs.org - reduced diffs from previous code]

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 9a94d3ee 07-Oct-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings

This adds a file called 'radix' in the debugfs directory for the
guest, which when read gives all of the valid leaf PTEs in the
partition-scoped radix tree for a radix guest, in human-readable
format. It is analogous to the existing 'htab' file which dumps
the HPT entries for a HPT guest.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6579804c 03-Oct-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Avoid crash from THP collapse during radix page fault

Commit 71d29f43b633 ("KVM: PPC: Book3S HV: Don't use compound_order to
determine host mapping size", 2018-09-11) added a call to
__find_linux_pte() and a dereference of the returned PTE pointer to the
radix page fault path in the common case where the page is normal
system memory. Previously, __find_linux_pte() was only called for
mappings to physical addresses which don't have a page struct (e.g.
memory-mapped I/O) or where the page struct is marked as reserved
memory.

This exposes us to the possibility that the returned PTE pointer
could be NULL, for example in the case of a concurrent THP collapse
operation. Dereferencing the returned NULL pointer causes a host
crash.

To fix this, we check for NULL, and if it is NULL, we retry the
operation by returning to the guest, with the expectation that it
will generate the same page fault again (unless of course it has
been fixed up by another CPU in the meantime).

Fixes: 71d29f43b633 ("KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 71d29f43 11-Sep-2018 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size

THP paths can defer splitting compound pages until after the actual
remap and TLB flushes to split a huge PMD/PUD. This causes radix
partition scope page table mappings to get out of synch with the host
qemu page table mappings.

This results in random memory corruption in the guest when running
with THP. The easiest way to reproduce is use KVM balloon to free up
a lot of memory in the guest and then shrink the balloon to give the
memory back, while some work is being done in the guest.

Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# c066fafc 14-Aug-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix()

Since commit e641a317830b ("KVM: PPC: Book3S HV: Unify dirty page map
between HPT and radix", 2017-10-26), kvm_unmap_radix() computes the
number of PAGE_SIZEd pages being unmapped and passes it to
kvmppc_update_dirty_map(), which expects to be passed the page size
instead. Consequently it will only mark one system page dirty even
when a large page (for example a THP page) is being unmapped. The
consequence of this is that part of the THP page might not get copied
during live migration, resulting in memory corruption for the guest.

This fixes it by computing and passing the page size in kvm_unmap_radix().

Cc: stable@vger.kernel.org # v4.15+
Fixes: e641a317830b (KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 2bf1071a 05-Jul-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/64s: Remove POWER9 DD1 support

POWER9 DD1 was never a product. It is no longer supported by upstream
firmware, and it is not effectively supported in Linux due to lack of
testing.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
[mpe: Remove arch_make_huge_pte() entirely]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 878cf2bb 17-May-2018 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: radix: Do not clear partition PTE when RC or write bits do not match

Adding the write bit and RC bits to pte permissions does not require a
pte clear and flush. There should not be other bits changed here,
because restricting access or changing the PFN must have already
invalidated any existing ptes (otherwise the race is already lost).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# bc64dd0e 17-May-2018 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: radix: Refine IO region partition scope attributes

When the radix fault handler has no page from the process address
space (e.g., for IO memory), it looks up the process pte and sets
partition table pte using that to get attributes like CI and guarded.
If the process table entry is to be writable, set _PAGE_DIRTY as well
to avoid an RC update. If not, then ensure _PAGE_DIRTY does not come
across. Set _PAGE_ACCESSED as well to avoid RC update.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d91cb39f 17-May-2018 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Make radix use the Linux translation flush functions for partition scope

This has the advantage of consolidating TLB flush code in fewer
places, and it also implements powerpc:tlbie trace events.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# a5704e83 17-May-2018 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Recursively unmap all page table entries when unmapping

When partition scope mappings are unmapped with kvm_unmap_radix, the
pte is cleared, but the page table structure is left in place. If the
next page fault requests a different page table geometry (e.g., due to
THP promotion or split), kvmppc_create_pte is responsible for changing
the page tables.

When a page table entry is to be converted to a large pte, the page
table entry is cleared, the PWC flushed, then the page table it points
to freed. This will cause pte page tables to leak when a 1GB page is
to replace a pud entry points to a pmd table with pte tables under it:
The pmd table will be freed, but its pte tables will be missed.

Fix this by replacing the simple clear and free code with one that
walks down the page tables and frees children. Care must be taken to
clear the root entry being unmapped then flushing the PWC before
freeing any page tables, as explained in comments.

This requires PWC flush to logically become a flush-all-PWC (which it
already is in hardware, but the KVM API needs to be changed to avoid
confusion).

This code also checks that no unexpected pte entries exist in any page
table being freed, and unmaps those and emits a WARN. This is an
expensive operation for the pte page level, but partition scope
changes are rare, so it's unconditional for now to iron out bugs. It
can be put under a CONFIG option or removed after some time.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# a5fad1e9 17-May-2018 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Use a helper to unmap ptes in the radix fault path

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 7e3d9a1d 08-May-2018 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Make radix clear pte when unmapping

The current partition table unmap code clears the _PAGE_PRESENT bit
out of the pte, which leaves pud_huge/pmd_huge true and does not
clear pud_present/pmd_present. This can confuse subsequent page
faults and possibly lead to the guest looping doing continual
hypervisor page faults.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# e2560b10 08-May-2018 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page

The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it
is ordered with respect to subsequent operations.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 21828c99 16-Apr-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/kvm: Switch kvm pmd allocator to custom allocator

In the next set of patches, we will switch pmd allocator to use page fragments
and the locking will be updated to split pmd ptlock. We want to avoid using
fragments for partition-scoped table. Use slab cache similar to level 4 table

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 31c8b0d0 28-Feb-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot() in page fault handler

This changes the hypervisor page fault handler for radix guests to use
the generic KVM __gfn_to_pfn_memslot() function instead of using
get_user_pages_fast() and then handling the case of VM_PFNMAP vmas
specially. The old code missed the case of VM_IO vmas; with this
change, VM_IO vmas will now be handled correctly by code within
__gfn_to_pfn_memslot.

Currently, __gfn_to_pfn_memslot calls hva_to_pfn, which only uses
__get_user_pages_fast for the initial lookup in the cases where
either atomic or async is set. Since we are not setting either
atomic or async, we do our own __get_user_pages_fast first, for now.

This also adds code to check for the KVM_MEM_READONLY flag on the
memslot. If it is set and this is a write access, we synthesize a
data storage interrupt for the guest.

In the case where the page is not normal RAM (i.e. page == NULL in
kvmppc_book3s_radix_page_fault(), we read the PTE from the Linux page
tables because we need the mapping attribute bits as well as the PFN.
(The mapping attribute bits indicate whether accesses have to be
non-cacheable and/or guarded.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# a5d4b589 22-Mar-2018 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Fixup tlbie vs store ordering issue on POWER9

On POWER9, under some circumstances, a broadcast TLB invalidation
might complete before all previous stores have drained, potentially
allowing stale stores from becoming visible after the invalidation.
This works around it by doubling up those TLB invalidations which was
verified by HW to be sufficient to close the risk window.

This will be documented in a yet-to-be-published errata.

Fixes: 1a472c9dba6b ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Enable the feature in the DT CPU features code for all Power9,
rename the feature to CPU_FTR_P9_TLBIE_BUG per benh.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 58c5c276 24-Feb-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Handle 1GB pages in radix page fault handler

This adds code to the radix hypervisor page fault handler to handle the
case where the guest memory is backed by 1GB hugepages, and put them
into the partition-scoped radix tree at the PUD level. The code is
essentially analogous to the code for 2MB pages. This also rearranges
kvmppc_create_pte() to make it easier to follow.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# f7caf712 24-Feb-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Streamline setting of reference and change bits

When using the radix MMU, we can get hypervisor page fault interrupts
with the DSISR_SET_RC bit set in DSISR/HSRR1, indicating that an
attempt to set the R (reference) or C (change) bit in a PTE atomically
failed. Previously we would find the corresponding Linux PTE and
check the permission and dirty bits there, but this is not really
necessary since we only need to do what the hardware was trying to
do, namely set R or C atomically. This removes the code that reads
the Linux PTE and just update the partition-scoped PTE, having first
checked that it is still present, and if the access is a write, that
the PTE still has write permission.

Furthermore, we now check whether any other relevant bits are set
in DSISR, and if there are, then we proceed with the rest of the
function in order to handle whatever condition they represent,
instead of returning to the guest as we did previously.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# c4c8a764 23-Feb-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Radix page fault handler optimizations

This improves the handling of transparent huge pages in the radix
hypervisor page fault handler. Previously, if a small page is faulted
in to a 2MB region of guest physical space, that means that there is
a page table pointer at the PMD level, which could never be replaced
by a leaf (2MB) PMD entry. This adds the code to clear the PMD,
invlidate the page walk cache and free the page table page in this
situation, so that the leaf PMD entry can be created.

This also adds code to check whether a PMD or PTE being inserted is
the same as is already there (because of a race with another CPU that
faulted on the same page) and if so, we don't replace the existing
entry, meaning that we don't invalidate the PTE or PMD and do a TLB
invalidation.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# c3856aeb 23-Feb-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Fix handling of large pages in radix page fault handler

This fixes several bugs in the radix page fault handler relating to
the way large pages in the memory backing the guest were handled.
First, the check for large pages only checked for explicit huge pages
and missed transparent huge pages. Then the check that the addresses
(host virtual vs. guest physical) had appropriate alignment was
wrong, meaning that the code never put a large page in the partition
scoped radix tree; it was always demoted to a small page.

Fixing this exposed bugs in kvmppc_create_pte(). We were never
invalidating a 2MB PTE, which meant that if a page was initially
faulted in without write permission and the guest then attempted
to store to it, we would never update the PTE to have write permission.
If we find a valid 2MB PTE in the PMD, we need to clear it and
do a TLB invalidation before installing either the new 2MB PTE or
a pointer to a page table page.

This also corrects an assumption that get_user_pages_fast would set
the _PAGE_DIRTY bit if we are writing, which is not true. Instead we
mark the page dirty explicitly with set_page_dirty_lock(). This
also means we don't need the dirty bit set on the host PTE when
providing write access on a read fault.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 117647ff 09-Nov-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Fix typo in kvmppc_hv_get_dirty_log_radix()

This fixes a typo where the intent was to assign to 'j' in order to
skip some number of bits in the dirty bitmap for a guest. The effect
of the typo is benign since it means we just iterate through all the
bits rather than skipping bits which we know will be zero. This issue
was found by Coverity.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 18c3640c 13-Sep-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host

This sets up the machinery for switching a guest between HPT (hashed
page table) and radix MMU modes, so that in future we can run a HPT
guest on a radix host on POWER9 machines.

* The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or
radix mode, on a radix host.

* The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9
with HV KVM on a radix host.

* The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a
radix host.

* The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the
guest to HPT mode and allocate a HPT.

* For simplicity, we now allocate the rmap array for each memslot,
even on a radix host, since it will be needed if the guest switches
to HPT mode.

* Since we cannot yet run a HPT guest on a radix host, the KVM_RUN
ioctl will return an EINVAL error in that case.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# e641a317 25-Oct-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix

Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.

This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.

There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.

This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 94171b19 27-Jul-2017 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/mm: Rename find_linux_pte_or_hugepte()

Add newer helpers to make the function usage simpler. It is always
recommended to use find_current_mm_pte() for walking the page table.
If we cannot use find_current_mm_pte(), it should be documented why
the said usage of __find_linux_pte() is safe against a parallel THP
split.

For now we have KVM code using __find_linux_pte(). This is because kvm
code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but
with PACA soft_enabled = 1. We may want to fix that later and make
sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that
we can switch kvm to use find_linux_pte().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 870cfe77 18-Jul-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/mm: Update definitions of DSISR bits

This updates the definitions for the various DSISR bits to
match both some historical stuff and to match new bits on
POWER9.

In addition, we define some masks corresponding to the "bad"
faults on Book3S, and some masks corresponding to the bits
that match between DSISR and SRR1 for a DSI and an ISI.

This comes with a small code update to change the definition
of DSISR_PGDIRFAULT which becomes DSISR_PRTABLE_FAULT to
match architecture 3.0B

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 70cd4c10 26-Feb-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Fix software walk of guest process page tables

This fixes some bugs in the code that walks the guest's page tables.
These bugs cause MMIO emulation to fail whenever the guest is in
virtial mode (MMU on), leading to the guest hanging if it tried to
access a virtio device.

The first bug was that when reading the guest's process table, we were
using the whole of arch->process_table, not just the field that contains
the process table base address. The second bug was that the mask used
when reading the process table entry to get the radix tree base address,
RPDB_MASK, had the wrong value.

Fixes: 9e04ba69beec ("KVM: PPC: Book3S HV: Add basic infrastructure for radix guests")
Fixes: e99833448c5f ("powerpc/mm/radix: Add partition table format & callback")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 8cf4ecc0 30-Jan-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Enable radix guest support

This adds a few last pieces of the support for radix guests:

* Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
KVM_PPC_GET_RMMU_INFO ioctls for radix guests

* On POWER9, allow secondary threads to be on/off-lined while guests
are running.

* Set up LPCR and the partition table entry for radix guests.

* Don't allocate the rmap array in the kvm_memory_slot structure
on radix.

* Don't try to initialize the HPT for radix guests, since they don't
have an HPT.

* Take out the code that prevents the HV KVM module from
initializing on radix hosts.

At this stage, we only support radix guests if the host is running
in radix mode, and only support HPT guests if the host is running in
HPT mode. Thus a guest cannot switch from one mode to the other,
which enables some simplifications.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8f7b79b8 30-Jan-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Implement dirty page logging for radix guests

This adds code to keep track of dirty pages when requested (that is,
when memslot->dirty_bitmap is non-NULL) for radix guests. We use the
dirty bits in the PTEs in the second-level (partition-scoped) page
tables, together with a bitmap of pages that were dirty when their
PTE was invalidated (e.g., when the page was paged out). This bitmap
is stored in the first half of the memslot->dirty_bitmap area, and
kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the
bitmap that gets returned to userspace.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 01756099 30-Jan-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: MMU notifier callbacks for radix guests

This adapts our implementations of the MMU notifier callbacks
(unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva)
to call radix functions when the guest is using radix. These
implementations are much simpler than for HPT guests because we
have only one PTE to deal with, so we don't need to traverse
rmap chains.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5a319350 30-Jan-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Page table construction and page faults for radix guests

This adds the code to construct the second-level ("partition-scoped" in
architecturese) page tables for guests using the radix MMU. Apart from
the PGD level, which is allocated when the guest is created, the rest
of the tree is all constructed in response to hypervisor page faults.

As well as hypervisor page faults for missing pages, we also get faults
for reference/change (RC) bits needing to be set, as well as various
other error conditions. For now, we only set the R or C bit in the
guest page table if the same bit is set in the host PTE for the
backing page.

This code can take advantage of the guest being backed with either
transparent or ordinary 2MB huge pages, and insert 2MB page entries
into the guest page tables. There is no support for 1GB huge pages
yet.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 9e04ba69 30-Jan-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Add basic infrastructure for radix guests

This adds a field in struct kvm_arch and an inline helper to
indicate whether a guest is a radix guest or not, plus a new file
to contain the radix MMU code, which currently contains just a
translate function which knows how to traverse the guest page
tables to translate an address.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>