History log of /linux-master/arch/powerpc/include/asm/kvm_ppc.h
Revision Date Author Comments
# b7bce570 10-Oct-2023 Benjamin Gray <bgray@linux.ibm.com>

powerpc/kvm: Force cast endianness of KVM shared regs

Sparse reports endianness mismatches in the KVM shared regs getter and
setter helpers.

This code has dynamic endianness behind a safe interface, so a force is
warranted here to tell sparse this is OK.

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

Link: https://msgid.link/20231011053711.93427-8-bgray@linux.ibm.com


# 19d31c5f 13-Sep-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Add support for nestedv2 guests

A series of hcalls have been added to the PAPR which allow a regular
guest partition to create and manage guest partitions of its own. KVM
already had an interface that allowed this on powernv platforms. This
existing interface will now be called "nestedv1". The newly added PAPR
interface will be called "nestedv2". PHYP will support the nestedv2
interface. At this time the host side of the nestedv2 interface has not
been implemented on powernv but there is no technical reason why it
could not be added.

The nestedv1 interface is still supported.

Add support to KVM to utilize these hcalls to enable running nested
guests as a pseries guest on PHYP.

Overview of the new hcall usage:

- L1 and L0 negotiate capabilities with
H_GUEST_{G,S}ET_CAPABILITIES()

- L1 requests the L0 create a L2 with
H_GUEST_CREATE() and receives a handle to use in future hcalls

- L1 requests the L0 create a L2 vCPU with
H_GUEST_CREATE_VCPU()

- L1 sets up the L2 using H_GUEST_SET and the
H_GUEST_VCPU_RUN input buffer

- L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()

- L2 returns to L1 with an exit reason and L1 reads the
H_GUEST_VCPU_RUN output buffer populated by the L0

- L1 handles the exit using H_GET_STATE if necessary

- L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

- L1 frees the L2 in the L0 with H_GUEST_DELETE()

Support for the new API is determined by trying
H_GUEST_GET_CAPABILITIES. On a successful return, use the nestedv2
interface.

Use the vcpu register state setters for tracking modified guest state
elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
buffer immediately before running a L2. The guest wide
elements can not be added to the input buffer so send them with a
separate H_GUEST_SET call if necessary.

Make the vcpu register getter load the corresponding value from the real
host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
which values have already been loaded between H_GUEST_VCPU_RUN calls. If
an element is present in the H_GUEST_VCPU_RUN output buffer it also does
not need to be loaded again.

Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
Signed-off-by: Amit Machhiwal <amachhiw@linux.vnet.ibm.com>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230914030600.16993-11-jniethe5@gmail.com


# 2a64bc67 13-Sep-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Rename accessor generator macros

More "wrapper" style accessor generating macros will be introduced for
the nestedv2 guest support. Rename the existing macros with more
descriptive names now so there is a consistent naming convention.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230914030600.16993-4-jniethe5@gmail.com


# 9fee28ba 02-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

powerpc: implement the new page table range API

Add set_ptes(), update_mmu_cache_range() and flush_dcache_folio(). Change
the PG_arch_1 (aka PG_dcache_dirty) flag from being per-page to per-folio.

[willy@infradead.org: re-export flush_dcache_icache_folio()]
Link: https://lkml.kernel.org/r/ZMx1daYwvD9EM7Cv@casper.infradead.org
Link: https://lkml.kernel.org/r/20230802151406.3735276-22-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 86dacd96 09-May-2023 Rohan McLure <rmclure@linux.ibm.com>

powerpc: Mark writes registering ipi to host cpu through kvm and polling

Mark writes to hypervisor ipi state so that KCSAN recognises these
asynchronous issue of kvmppc_{set,clear}_host_ipi to be intended, with
atomic writes. Mark asynchronous polls to this variable in
kvm_ppc_read_one_intr().

Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230510033117.1395895-9-rmclure@linux.ibm.com


# 67c48662e 08-Feb-2023 Thomas Huth <thuth@redhat.com>

KVM: PPC: Standardize on "int" return types in the powerpc KVM code

Most functions that are related to kvm_arch_vm_ioctl() already use
"int" as return type to pass error values back to the caller. Some
outlier functions use "long" instead for no good reason (they do not
really require long values here). Let's standardize on "int" here to
avoid casting the values back and forth between the two types.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230208140105.655814-2-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 953e3739 07-Mar-2023 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Fetch prefixed instructions from the guest

In order to handle emulation of prefixed instructions in the guest,
this first makes vcpu->arch.last_inst be an unsigned long, i.e. 64
bits on 64-bit platforms. For prefixed instructions, the upper 32
bits are used for the prefix and the lower 32 bits for the suffix, and
both halves are byte-swapped if the guest endianness differs from the
host.

Next, vcpu->arch.emul_inst is now 64 bits wide, to match the HEIR
register on POWER10. Like HEIR, for a prefixed instruction it is
defined to have the prefix is in the top 32 bits and the suffix in the
bottom 32 bits, with both halves in the correct byte order.

kvmppc_get_last_inst is extended on 64-bit machines to put the prefix
and suffix in the right places in the ppc_inst_t being returned.

kvmppc_load_last_inst now returns the instruction in an unsigned long
in the same format as vcpu->arch.last_inst. It makes the decision
about whether to fetch a suffix based on the SRR1_PREFIXED bit in the
MSR image stored in the vcpu struct, which generally comes from SRR1
or HSRR1 on an interrupt. This bit is defined in Power ISA v3.1B to
be set if the interrupt occurred due to a prefixed instruction and
cleared otherwise for all interrupts except for instruction storage
interrupt, which does not come to the hypervisor. It is set to zero
for asynchronous interrupts such as external interrupts. In previous
ISA versions it was always set to 0 for all interrupts except
instruction storage interrupt.

The code in book3s_hv_rmhandlers.S that loads the faulting instruction
on a HDSI is only used on POWER8 and therefore doesn't ever need to
load a suffix.

[npiggin@gmail.com - check that the is-prefixed bit in SRR1 matches the
type of instruction that was fetched.]

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/ZAgsq9h1CCzouQuV@cleo


# acf17878 07-Mar-2023 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Make kvmppc_get_last_inst() produce a ppc_inst_t

This changes kvmppc_get_last_inst() so that the instruction it fetches
is returned in a ppc_inst_t variable rather than a u32. This will
allow us to return a 64-bit prefixed instruction on those 64-bit
machines that implement Power ISA v3.1 or later, such as POWER10.
On 32-bit platforms, ppc_inst_t is 32 bits wide, and is turned back
into a u32 by ppc_inst_val, which is an identity operation on those
platforms.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/ZAgsiPlL9O7KnlZZ@cleo


# 460ba21d 30-Mar-2023 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Permit SRR1 flags in more injected interrupt types

The prefix architecture in ISA v3.1 introduces a prefixed bit in SRR1
for many types of synchronous interrupts which is set when the interrupt
is caused by a prefixed instruction.

This requires KVM to be able to set this bit when injecting interrupts
into a guest. Plumb through the SRR1 "flags" argument to the core_queue
APIs where it's missing for this. For now they are set to 0, which is
no change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fixup kvmppc_core_queue_alignment() in booke.c]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230330103224.3589928-2-npiggin@gmail.com


# ae19b15d 30-Nov-2022 Sean Christopherson <seanjc@google.com>

KVM: PPC: Move processor compatibility check to module init

Move KVM PPC's compatibility checks to their respective module_init()
hooks, there's no need to wait until KVM's common compat check, nor is
there a need to perform the check on every CPU (provided by common KVM's
hook), as the compatibility checks operate on global data.

arch/powerpc/include/asm/cputable.h: extern struct cpu_spec *cur_cpu_spec;
arch/powerpc/kvm/book3s.c: return 0
arch/powerpc/kvm/e500.c: strcmp(cur_cpu_spec->cpu_name, "e500v2")
arch/powerpc/kvm/e500mc.c: strcmp(cur_cpu_spec->cpu_name, "e500mc")
strcmp(cur_cpu_spec->cpu_name, "e5500")
strcmp(cur_cpu_spec->cpu_name, "e6500")

Cc: Fabiano Rosas <farosas@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Message-Id: <20221130230934.1014142-27-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6c645b01 27-Nov-2022 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3E: Fix CONFIG_TRACE_IRQFLAGS support

32-bit does not trace_irqs_off() to match the trace_irqs_on() call in
kvmppc_fix_ee_before_entry(). This can lead to irqs being enabled twice
in the trace, and the irqs-off region between guest exit and the host
enabling local irqs again is not properly traced.

64-bit code does call this, but from asm code where volatiles are live
and so incorrectly get clobbered.

Move the irq reconcile into C to fix both problems.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221127124942.1665522-2-npiggin@gmail.com


# 3d7a198c 13-Sep-2022 Gaosheng Cui <cuigaosheng1@huawei.com>

KVM: PPC: remove orphan declarations from kvm_ppc.h

Remove the following orphan declarations from kvm_ppc.h:

1. kvmppc_mmu_priv_switch() has been removed since
commit dd9ebf1f9435 ("KVM: PPC: e500: Add shadow PID
support").
2. kvmppc_core_destroy_mmu() has been removed since
commit ecc0981ff07c ("KVM: ppc: cosmetic changes to mmu
hook names").
3. kvmppc_prepare_vrma() has been removed since
commit aa04b4cc5be6 ("KVM: PPC: Allocate RMAs (Real Mode
Areas) at boot for use by guests").

So remove the declarations for them from header file.

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220913075029.682327-10-cuigaosheng1@huawei.com


# b22af904 09-May-2022 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Book3s: Remove real mode interrupt controller hcalls handlers

Currently we have 2 sets of interrupt controller hypercalls handlers
for real and virtual modes, this is from POWER8 times when switching
MMU on was considered an expensive operation.

POWER9 however does not have dependent threads and MMU is enabled for
handling hcalls so the XIVE native or XICS-on-XIVE real mode handlers
never execute on real P9 and later CPUs.

This untemplate the handlers and only keeps the real mode handlers for
XICS native (up to POWER8) and remove the rest of dead code. Changes
in functions are mechanical except few missing empty lines to make
checkpatch.pl happy.

The default implemented hcalls list already contains XICS hcalls so
no change there.

This should not cause any behavioral change.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220509071150.181250-1-aik@ozlabs.ru


# cad32d9d 05-May-2022 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers

LoPAPR defines guest visible IOMMU with hypercalls to use it -
H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap
in the KVM in the real mode (with MMU off). The problem with the real mode
is some memory is not available and some API usage crashed the host but
enabling MMU was an expensive operation.

The problems with the real mode handlers are:
1. Occasionally these cannot complete the request so the code is
copied+modified to work in the virtual mode, very little is shared;
2. The real mode handlers have to be linked into vmlinux to work;
3. An exception in real mode immediately reboots the machine.

If the small DMA window is used, the real mode handlers bring better
performance. However since POWER8, there has always been a bigger DMA
window which VMs use to map the entire VM memory to avoid calling
H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT
(a bulk version of H_PUT_TCE) which virtual mode handler is even closer
to its real mode version.

On POWER9 hypercalls trap straight to the virtual mode so the real mode
handlers never execute on POWER9 and later CPUs.

So with the current use of the DMA windows and MMU improvements in
POWER9 and later, there is no point in duplicating the code.
The 32bit passed through devices may slow down but we do not have many
of these in practice. For example, with this applied, a 1Gbit ethernet
adapter still demostrates above 800Mbit/s of actual throughput.

This removes the real mode handlers from KVM and related code from
the powernv platform.

This updates the list of implemented hcalls in KVM-HV as the realmode
handlers are removed.

This changes ABI - kvmppc_h_get_tce() moves to the KVM module and
kvmppc_find_table() is static now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220506053755.3820702-1-aik@ozlabs.ru


# ad5ace91 02-Mar-2022 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: Move cede logic out of XIVE escalation rearming

Move the cede abort logic out of xive escalation rearming and into
the caller to prepare for handling a similar case with nested guest
entry.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220303053315.1056880-4-npiggin@gmail.com


# 18827eee 23-Jan-2022 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Remove kvmppc_claim_lpid

Removing kvmppc_claim_lpid makes the lpid allocator API a bit simpler to
change the underlying implementation in a future patch.

The host LPID is always 0, so that can be a detail of the allocator. If
the allocator range is restricted, that can reserve LPIDs at the top of
the range. This allows kvmppc_claim_lpid to be removed.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220123120043.3586018-2-npiggin@gmail.com


# 7f921a2d 03-Apr-2022 Michael Ellerman <mpe@ellerman.id.au>

KVM: PPC: Move kvmhv_on_pseries() into kvm_ppc.h

We recently introduced a usage of kvmhv_on_pseries() in powerpc.c, which
causes a build error for ppc64_book3e_allmodconfig:

arch/powerpc/kvm/powerpc.c:716:8: error: implicit declaration of function ‘kvmhv_on_pseries’
716 | if (kvmhv_on_pseries()) {
| ^~~~~~~~~~~~~~~~

Fix it by moving kvmhv_on_pseries() into kvm_ppc.h so that the stub
version is available for book3e builds.

Fixes: f771b55731fc ("KVM: PPC: Use KVM_CAP_PPC_AIL_MODE_3")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# faf01aef 10-Jan-2022 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Merge powerpc's debugfs entry content into generic entry

At the moment KVM on PPC creates 4 types of entries under the kvm debugfs:
1) "%pid-%fd" per a KVM instance (for all platforms);
2) "vm%pid" (for PPC Book3s HV KVM);
3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM);
4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS);

The problem with this is that multiple VMs per process is not allowed for
2) and 3) which makes it possible for userspace to trigger errors when
creating duplicated debugfs entries.

This merges all these into 1).

This defines kvm_arch_create_kvm_debugfs() similar to
kvm_arch_create_vcpu_debugfs().

This defines 2 hooks in kvmppc_ops that allow specific KVM implementations
add necessary entries, this adds the _e500 suffix to
kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for.

This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

This stops removing vcpu entries as once created vcpus stay around
for the entire life of a VM and removed when the KVM instance is closed,
see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU
debugfs directories").

Suggested-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220111005404.162219-1-aik@ozlabs.ru


# eaaaed13 06-Dec-2021 Sean Christopherson <seanjc@google.com>

KVM: PPC: Avoid referencing userspace memory region in memslot updates

For PPC HV, get the number of pages directly from the new memslot instead
of computing the same from the userspace memory region, and explicitly
check for !DELETE instead of inferring the same when toggling mmio_update.
The motivation for these changes is to avoid referencing the @mem param
so that it can be dropped in a future commit.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1e97fb5198be25f98ef82e63a8d770c682264cc9.1638817639.git.maciej.szmigiero@oracle.com>


# 537a17b3 06-Dec-2021 Sean Christopherson <seanjc@google.com>

KVM: Let/force architectures to deal with arch specific memslot data

Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch
code to handle propagating arch specific data from "new" to "old" when
necessary. This is a baby step towards dynamically allocating "new" from
the get go, and is a (very) minor performance boost on x86 due to not
unnecessarily copying arch data.

For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE
and FLAGS_ONLY. This is functionally a nop as the previous behavior
would overwrite the pointer for CREATE, and eventually discard/ignore it
for DELETE.

For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV,
x86 needs to reallocate arch data in the MOVE case as the size of x86's
allocations depend on the alignment of the memslot's gfn.

Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to
match the "commit" prototype.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[mss: add missing RISCV kvm_arch_prepare_memory_region() change]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>


# 9c5a432a 23-Nov-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: Remove subcore HMI handling

On POWER9 and newer, rather than the complex HMI synchronisation and
subcore state, have each thread un-apply the guest TB offset before
calling into the early HMI handler.

This allows the subcore state to be avoided, including subcore enter
/ exit guest, which includes an expensive divide that shows up
slightly in profiles.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-54-npiggin@gmail.com


# 0ba0e5d5 23-Nov-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Split P8 from P9 path guest vCPU TLB flushing

This creates separate functions for old and new paths for vCPU TLB
flushing, which will reduce complexity of the next change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-43-npiggin@gmail.com


# e5e78b15 01-Jul-2021 Cédric Le Goater <clg@kaod.org>

KVM: PPC: Book3S HV: XIVE: Change interface of passthrough interrupt routines

The routine kvmppc_set_passthru_irq() calls kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped() with an IRQ descriptor. Use directly the
host IRQ number to remove a useless conversion.

Add some debug.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210701132750.1475580-15-clg@kaod.org


# dcbac73a 28-May-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Remove virt mode checks from real mode handlers

Now that the P7/8 path no longer supports radix, real-mode handlers
do not need to deal with being called in virt mode.

This change effectively reverts commit acde25726bc6 ("KVM: PPC: Book3S
HV: Add radix checks in real-mode hypercall handlers").

It removes a few more real-mode tests in rm hcall handlers, which
allows the indirect ops for the xive module to be removed from the
built-in xics rm handlers.

kvmppc_h_random is renamed to kvmppc_rm_h_random to be a bit more
descriptive and consistent with other rm handlers.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-25-npiggin@gmail.com


# 9dc2babc 28-May-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path

In the interest of minimising the amount of code that is run in
"real-mode", don't handle hcalls in real mode in the P9 path. This
requires some new handlers for H_CEDE and xics-on-xive to be added
before xive is pulled or cede logic is checked.

This introduces a change in radix guest behaviour where radix guests
that execute 'sc 1' in userspace now get a privilege fault whereas
previously the 'sc 1' would be reflected as a syscall interrupt to the
guest kernel. That reflection is only required for hash guests that run
PR KVM.

Background:

In POWER8 and earlier processors, it is very expensive to exit from the
HV real mode context of a guest hypervisor interrupt, and switch to host
virtual mode. On those processors, guest->HV interrupts reach the
hypervisor with the MMU off because the MMU is loaded with guest context
(LPCR, SDR1, SLB), and the other threads in the sub-core need to be
pulled out of the guest too. Then the primary must save off guest state,
invalidate SLB and ERAT, and load up host state before the MMU can be
enabled to run in host virtual mode (~= regular Linux mode).

Hash guests also require a lot of hcalls to run due to the nature of the
MMU architecture and paravirtualisation design. The XICS interrupt
controller requires hcalls to run.

So KVM traditionally tries hard to avoid the full exit, by handling
hcalls and other interrupts in real mode as much as possible.

By contrast, POWER9 has independent MMU context per-thread, and in radix
mode the hypervisor is in host virtual memory mode when the HV interrupt
is taken. Radix guests do not require significant hcalls to manage their
translations, and xive guests don't need hcalls to handle interrupts. So
it's much less important for performance to handle hcalls in real mode on
POWER9.

One caveat is that the TCE hcalls are performance critical, real-mode
variants introduced for POWER8 in order to achieve 10GbE performance.
Real mode TCE hcalls were found to be less important on POWER9, which
was able to drive 40GBe networking without them (using the virt mode
hcalls) but performance is still important. These hcalls will benefit
from subsequent guest entry/exit optimisation including possibly a
faster "partial exit" that does not entirely switch to host context to
handle the hcall.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-14-npiggin@gmail.com


# 023c3c96 28-May-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: implement kvmppc_xive_pull_vcpu in C

This is more symmetric with kvmppc_xive_push_vcpu, and has the advantage
that it runs with the MMU on.

The extra test added to the asm will go away with a future change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-9-npiggin@gmail.com


# b1c5356e 01-Apr-2021 Sean Christopherson <seanjc@google.com>

KVM: PPC: Convert to the gfn-based MMU notifier callbacks

Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion
PPC-internal hooks to work with gfns instead of hvas.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6c12c437 11-Apr-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: remove unused kvmppc_h_protect argument

The va argument is not used in the function or set by its asm caller,
so remove it to be safe.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210412014845.1517916-8-npiggin@gmail.com


# ec94b9b2 02-Feb-2021 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc/mm: Add PG_dcache_clean to indicate dcache clean state

This just add a better name for PG_arch_1. No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210203045812.234439-2-aneesh.kumar@linux.ibm.com


# 6c6fdbb2 25-Jan-2021 Chengyang Fan <cy.fan@huawei.com>

powerpc: remove unneeded semicolons

Remove superfluous semicolons after function definitions.

Signed-off-by: Chengyang Fan <cy.fan@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210125095338.1719405-1-cy.fan@huawei.com


# a722076e 05-Feb-2021 Fabiano Rosas <farosas@linux.ibm.com>

KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2

These machines don't support running both MMU types at the same time,
so remove the KVM_CAP_PPC_MMU_HASH_V3 capability when the host is
using Radix MMU.

[paulus@ozlabs.org - added defensive check on
kvmppc_hv_ops->hash_v3_possible]

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d9a47eda 16-Dec-2020 Ravi Bangoria <ravi.bangoria@linux.ibm.com>

KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR

Introduce KVM_CAP_PPC_DAWR1 which can be used by QEMU to query whether
KVM supports 2nd DAWR or not. The capability is by default disabled
even when the underlying CPU supports 2nd DAWR. QEMU needs to check
and enable it manually to use the feature.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 7ec21d9d 23-Jun-2020 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>

KVM: PPC: Clean up redundant kvm_run parameters in assembly

In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

[paulus@ozlabs.org - Fixed places that were missed in book3s_interrupts.S]

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 8c99d345 26-Apr-2020 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>

KVM: PPC: Clean up redundant 'kvm_run' parameters

In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 9a5788c6 18-Mar-2020 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Add a capability for enabling secure guests

At present, on Power systems with Protected Execution Facility
hardware and an ultravisor, a KVM guest can transition to being a
secure guest at will. Userspace (QEMU) has no way of knowing
whether a host system is capable of running secure guests. This
will present a problem in future when the ultravisor is capable of
migrating secure guests from one host to another, because
virtualization management software will have no way to ensure that
secure guests only run in domains where all of the hosts can
support secure guests.

This adds a VM capability which has two functions: (a) userspace
can query it to find out whether the host can support secure guests,
and (b) userspace can enable it for a guest, which allows that
guest to become a secure guest. If userspace does not enable it,
KVM will return an error when the ultravisor does the hypercall
that indicates that the guest is starting to transition to a
secure guest. The ultravisor will then abort the transition and
the guest will terminate.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>


# 6fef0c6b 18-Mar-2020 Greg Kurz <groug@kaod.org>

KVM: PPC: Kill kvmppc_ops::mmu_destroy() and kvmppc_mmu_destroy()

These are only used by HV KVM and BookE, and in both cases they are
nops.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 3f1268dd 18-Mar-2020 Greg Kurz <groug@kaod.org>

KVM: PPC: Book3S PR: Move kvmppc_mmu_init() into PR KVM

This is only relevant to PR KVM. Make it obvious by moving the
function declaration to the Book3s header and rename it with
a _pr suffix.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# e96c81ee 18-Feb-2020 Sean Christopherson <seanjc@google.com>

KVM: Simplify kvm_free_memslot() and all its descendents

Now that all callers of kvm_free_memslot() pass NULL for @dont, remove
the param from the top-level routine and all arch's implementations.

No functional change intended.

Tested-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 82307e67 18-Feb-2020 Sean Christopherson <seanjc@google.com>

KVM: PPC: Move memslot memory allocation into prepare_memory_region()

Allocate the rmap array during kvm_arch_prepare_memory_region() to pave
the way for removing kvm_arch_create_memslot() altogether. Moving PPC's
memory allocation only changes the order of kernel memory allocations
between PPC and common KVM code.

No functional change intended.

Acked-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ff030fdf 18-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: PPC: Move kvm_vcpu_init() invocation to common code

Move the kvm_cpu_{un}init() calls to common PPC code as an intermediate
step towards removing kvm_cpu_{un}init() altogether.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c50bfbdc 18-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: PPC: Allocate vcpu struct in common PPC code

Move allocation of all flavors of PPC vCPUs to common PPC code. All
variants either allocate 'struct kvm_vcpu' directly, or require that
the embedded 'struct kvm_vcpu' member be located at offset 0, i.e.
guarantee that the allocation can be directly interpreted as a 'struct
kvm_vcpu' object.

Remove the message from the build-time assertion regarding placement of
the struct, as compatibility with the arch usercopy region is no longer
the sole dependent on 'struct kvm_vcpu' being at offset zero.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 22945688 24-Nov-2019 Bharata B Rao <bharata@linux.ibm.com>

KVM: PPC: Book3S HV: Support reset of secure guest

Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
This ioctl will be issued by QEMU during reset and includes the
the following steps:

- Release all device pages of the secure guest.
- Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
- Unpin the VPA pages so that they can be migrated back to secure
side when guest becomes secure again. This is required because
pinned pages can't be migrated.
- Reinit the partition scoped page tables

After these steps, guest is ready to issue UV_ESM call once again
to switch to secure mode.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[Implementation of uv_svm_terminate() and its call from
guest shutdown path]
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[Unpinning of VPA pages]
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 87a45e07 02-Oct-2019 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S: Replace reset_msr mmu op with inject_interrupt arch op

reset_msr sets the MSR for interrupt injection, but it's cleaner and
more flexible to provide a single op to set both MSR and PC for the
interrupt.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 3a83f677 11-Sep-2019 Michael Roth <mdroth@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag

On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
of memory running the following guest configs:

guest A:
- 224GB of memory
- 56 VCPUs (sockets=1,cores=28,threads=2), where:
VCPUs 0-1 are pinned to CPUs 0-3,
VCPUs 2-3 are pinned to CPUs 4-7,
...
VCPUs 54-55 are pinned to CPUs 108-111

guest B:
- 4GB of memory
- 4 VCPUs (sockets=1,cores=4,threads=1)

with the following workloads (with KSM and THP enabled in all):

guest A:
stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M

guest B:
stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M

host:
stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M

the below soft-lockup traces were observed after an hour or so and
persisted until the host was reset (this was found to be reliably
reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
and 5.3-rc5):

[ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1253.183319] rcu: 124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
[ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
[ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
[ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
[ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
[ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
[ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1316.198032] rcu: 124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
[ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1379.212629] rcu: 124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
[ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
[ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1442.227115] rcu: 124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
[ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
[ 1455.111822] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
[ 1455.111905] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
[ 1455.111986] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
[ 1455.112068] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
[ 1455.112159] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
[ 1455.112231] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
[ 1455.112303] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1
[ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
[ 1455.112392] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1

CPUs 45, 24, and 124 are stuck on spin locks, likely held by
CPUs 105 and 31.

CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
target CPU 42. For instance:

# CPU 105 registers (via xmon)
R00 = c00000000020b20c R16 = 00007d1bcd800000
R01 = c00000363eaa7970 R17 = 0000000000000001
R02 = c0000000019b3a00 R18 = 000000000000006b
R03 = 000000000000002a R19 = 00007d537d7aecf0
R04 = 000000000000002a R20 = 60000000000000e0
R05 = 000000000000002a R21 = 0801000000000080
R06 = c0002073fb0caa08 R22 = 0000000000000d60
R07 = c0000000019ddd78 R23 = 0000000000000001
R08 = 000000000000002a R24 = c00000000147a700
R09 = 0000000000000001 R25 = c0002073fb0ca908
R10 = c000008ffeb4e660 R26 = 0000000000000000
R11 = c0002073fb0ca900 R27 = c0000000019e2464
R12 = c000000000050790 R28 = c0000000000812b0
R13 = c000207fff623e00 R29 = c0002073fb0ca808
R14 = 00007d1bbee00000 R30 = c0002073fb0ca800
R15 = 00007d1bcd600000 R31 = 0000000000000800
pc = c00000000020b260 smp_call_function_many+0x3d0/0x460
cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
lr = c00000000020b20c smp_call_function_many+0x37c/0x460
msr = 900000010288b033 cr = 44024824
ctr = c000000000050790 xer = 0000000000000000 trap = 100

CPU 42 is running normally, doing VCPU work:

# CPU 42 stack trace (via xmon)
[link register ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
[c000008ed3343820] c000008ed3343850 (unreliable)
[c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
[c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
[c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
[c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
[c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
[c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
[c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
--- Exception: c00 (System Call) at 00007d545cfd7694
SP (7d53ff7edf50) is in userspace

It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
was set for CPU 42 by at least 1 of the CPUs waiting in
smp_call_function_many(), but somehow the corresponding
call_single_queue entries were never processed by CPU 42, causing the
callers to spin in csd_lock_wait() indefinitely.

Nick Piggin suggested something similar to the following sequence as
a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
IPI messages seems to be most common, but any mix of CALL_FUNCTION and
!CALL_FUNCTION messages could trigger it):

CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
105: smp_muxed_ipi_set_message():
105: smb_mb()
// STORE DEFERRED DUE TO RE-ORDERING
--105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 42: // returns to executing guest
| // RE-ORDERED STORE COMPLETES
->105: message[CALL_FUNCTION] = 1
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue

This can be prevented with an smp_mb() at the beginning of
kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
state indicated by the host_ipi flag) are ordered vs. the store to
to host_ipi.

However, doing so might still allow for the following scenario (not
yet observed):

CPU
X: smp_muxed_ipi_set_message():
X: smp_mb()
X: message[RESCHEDULE] = 1
X: doorbell_global_ipi(42):
X: kvmppc_set_host_ipi(42, 1)
X: ppc_msgsnd_sync()/smp_mb()
X: ppc_msgsnd() -> 42
42: doorbell_exception(): // from CPU X
42: ppc_msgsync()
// STORE DEFERRED DUE TO RE-ORDERING
-- 42: kvmppc_set_host_ipi(42, 0)
| 42: smp_ipi_demux_relaxed()
| 105: smp_muxed_ipi_set_message():
| 105: smb_mb()
| 105: message[CALL_FUNCTION] = 1
| 105: doorbell_global_ipi(42):
| 105: kvmppc_set_host_ipi(42, 1)
| // RE-ORDERED STORE COMPLETES
-> 42: kvmppc_set_host_ipi(42, 0)
42: // returns to executing guest
105: ppc_msgsnd_sync()/smp_mb()
105: ppc_msgsnd() -> 42
42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue

Fixing this scenario would require an smp_mb() *after* clearing
host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
subsequent processing of IPI messages.

To handle both cases, this patch splits kvmppc_set_host_ipi() into
separate set/clear functions, where we execute smp_mb() prior to
setting host_ipi flag, and after clearing host_ipi flag. These
functions pair with each other to synchronize the sender and receiver
sides.

With that change in place the above workload ran for 20 hours without
triggering any lock-ups.

Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com


# 2ad7a27d 26-Aug-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions

There are some POWER9 machines where the OPAL firmware does not support
the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
The impact of this is that a guest using XIVE natively will not be able
to be migrated successfully. On the source side, the get_attr operation
on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
will fail; on the destination side, the set_attr operation for the same
attribute will fail.

This adds tests for the existence of the OPAL get/set queue state
functions, and if they are not supported, the XIVE-native KVM device
is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
Userspace can then either provide a software emulation of XIVE, or
else tell the guest that it does not have a XIVE controller available
to it.

Cc: stable@vger.kernel.org # v5.2+
Fixes: 3fab2d10588e ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d94d71cb 29-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 266

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation 51 franklin street fifth floor boston ma 02110
1301 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 67 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141333.953658117@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# e4945b9d 17-Apr-2019 Cédric Le Goater <clg@kaod.org>

KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE state

The state of the thread interrupt management registers needs to be
collected for migration. These registers are cached under the
'xive_saved_state.w01' field of the VCPU when the VPCU context is
pulled from the HW thread. An OPAL call retrieves the backup of the
IPB register in the underlying XIVE NVT structure and merges it in the
KVM state.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# eacc56bb 17-Apr-2019 Cédric Le Goater <clg@kaod.org>

KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVE

The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
let QEMU connect the vCPU presenters to the XIVE KVM device if
required. The capability is not advertised for now as the full support
for the XIVE native exploitation mode is not yet available. When this
is case, the capability will be advertised on PowerNV Hypervisors
only. Nested guests (pseries KVM Hypervisor) are not supported.

Internally, the interface to the new KVM device is protected with a
new interrupt mode: KVMPPC_IRQ_XIVE.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 90c73795 17-Apr-2019 Cédric Le Goater <clg@kaod.org>

KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode

This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new KVM device
to be created by QEMU, only available when running on a L0 hypervisor.
Support for nested guests is not available yet.

The XIVE device reuses the device structure of the XICS-on-XIVE device
as they have a lot in common. That could possibly change in the future
if the need arise.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 70ea13f6 29-Apr-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Flush TLB on secondary radix threads

When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
If those guests are in radix mode, we fail to load the LPID and flush
the TLB if necessary, leading to the guest crashing with an
unsupported MMU fault. This arises from commit 9a4506e11b97 ("KVM:
PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
with relocation on", 2018-05-17), which didn't consider the case
where indep_threads_mode = N.

For simplicity, this makes the real-mode guest entry path flush the
TLB in the same place for both radix and hash guests, as we did before
9a4506e11b97, though the code is now C code rather than assembly code.
We also have the radix TLB flush open-coded rather than calling
radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
called in real mode, and in real mode we don't want to invoke the
tracepoint code.

Fixes: 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 2940ba0c 29-Apr-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C code

This replaces assembler code in book3s_hv_rmhandlers.S that checks
the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush
with C code in book3s_hv_builtin.c. Note that unlike the radix
version, the hash version doesn't do an explicit ERAT invalidation
because we will invalidate and load up the SLB before entering the
guest, and that will invalidate the ERAT.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# e1a1ef84 28-Mar-2019 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Book3S: Allocate guest TCEs on demand too

We already allocate hardware TCE tables in multiple levels and skip
intermediate levels when we can, now it is a turn of the KVM TCE tables.
Thankfully these are allocated already in 2 levels.

This moves the table's last level allocation from the creating helper to
kvmppc_tce_put() and kvm_spapr_tce_fault(). Since such allocation cannot
be done in real mode, this creates a virtual mode version of
kvmppc_tce_put() which handles allocations.

This adds kvmppc_rm_ioba_validate() to do an additional test if
the consequent kvmppc_tce_put() needs a page which has not been allocated;
if this is the case, we bail out to virtual mode handlers.

The allocations are protected by a new mutex as kvm->lock is not suitable
for the task because the fault handler is called with the mmap_sem held
but kvmhv_setup_mmu() locks kvm->lock and mmap_sem in the reverse order.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 2001825e 28-Mar-2019 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Book3S HV: Avoid lockdep debugging in TCE realmode handlers

The kvmppc_tce_to_ua() helper is called from real and virtual modes
and it works fine as long as CONFIG_DEBUG_LOCKDEP is not enabled.
However if the lockdep debugging is on, the lockdep will most likely break
in kvm_memslots() because of srcu_dereference_check() so we need to use
PPC-own kvm_memslots_raw() which uses realmode safe
rcu_dereference_raw_notrace().

This creates a realmode copy of kvmppc_tce_to_ua() which replaces
kvm_memslots() with kvm_memslots_raw().

Since kvmppc_rm_tce_to_ua() becomes static and can only be used inside
HV KVM, this moves it earlier under CONFIG_KVM_BOOK3S_HV_POSSIBLE.

This moves truly virtual-mode kvmppc_tce_to_ua() to where it belongs and
drops the prmap parameter which was never used in the virtual mode.

Fixes: d3695aa4f452 ("KVM: PPC: Add support for multiple-TCE hcalls", 2016-02-15)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# eadfb1c5 22-Mar-2019 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Implement real mode H_PAGE_INIT handler

Implement a real mode handler for the H_CALL H_PAGE_INIT which can be
used to zero or copy a guest page. The page is defined to be 4k and must
be 4k aligned.

The in-kernel real mode handler halves the time to handle this H_CALL
compared to handling it in userspace for a hash guest.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# e74d53e3 24-Feb-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Fix compilation when KVM is not enabled

Compiling with CONFIG_PPC_POWERNV=y and KVM disabled currently gives
an error like this:

CC arch/powerpc/kernel/dbell.o
In file included from arch/powerpc/kernel/dbell.c:20:0:
arch/powerpc/include/asm/kvm_ppc.h: In function ‘xics_on_xive’:
arch/powerpc/include/asm/kvm_ppc.h:625:9: error: implicit declaration of function ‘xive_enabled’ [-Werror=implicit-function-declaration]
return xive_enabled() && cpu_has_feature(CPU_FTR_HVMODE);
^
cc1: all warnings being treated as errors
scripts/Makefile.build:276: recipe for target 'arch/powerpc/kernel/dbell.o' failed
make[3]: *** [arch/powerpc/kernel/dbell.o] Error 1

Fix this by making the xics_on_xive() definition conditional on the
same symbol (CONFIG_KVM_BOOK3S_64_HANDLER) that determines whether we
include <asm/xive.h> or not, since that's the header that defines
xive_enabled().

Fixes: 03f953329bd8 ("KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 884dfb72 20-Feb-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Simplify machine check handling

This makes the handling of machine check interrupts that occur inside
a guest simpler and more robust, with less done in assembler code and
in real mode.

Now, when a machine check occurs inside a guest, we always get the
machine check event struct and put a copy in the vcpu struct for the
vcpu where the machine check occurred. We no longer call
machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
on POWER8, when a vcpu is running on an offline secondary thread and
we call machine_check_queue_event(), that calls irq_work_queue(),
which doesn't work because the CPU is offline, but instead triggers
the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
fires again and again because nothing clears the condition).

All that machine_check_queue_event() actually does is to cause the
event to be printed to the console. For a machine check occurring in
the guest, we now print the event in kvmppc_handle_exit_hv()
instead.

The assembly code at label machine_check_realmode now just calls C
code and then continues exiting the guest. We no longer either
synthesize a machine check for the guest in assembly code or return
to the guest without a machine check.

The code in kvmppc_handle_exit_hv() is extended to handle the case
where the guest is not FWNMI-capable. In that case we now always
synthesize a machine check interrupt for the guest. Previously, if
the host thinks it has recovered the machine check fully, it would
return to the guest without any notification that the machine check
had occurred. If the machine check was caused by some action of the
guest (such as creating duplicate SLB entries), it is much better to
tell the guest that it has caused a problem. Therefore we now always
generate a machine check interrupt for guests that are not
FWNMI-capable.

Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 03f95332 04-Feb-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE

Currently, the KVM code assumes that if the host kernel is using the
XIVE interrupt controller (the new interrupt controller that first
appeared in POWER9 systems), then the in-kernel XICS emulation will
use the XIVE hardware to deliver interrupts to the guest. However,
this only works when the host is running in hypervisor mode and has
full access to all of the XIVE functionality. It doesn't work in any
nested virtualization scenario, either with PR KVM or nested-HV KVM,
because the XICS-on-XIVE code calls directly into the native-XIVE
routines, which are not initialized and cannot function correctly
because they use OPAL calls, and OPAL is not available in a guest.

This means that using the in-kernel XICS emulation in a nested
hypervisor that is using XIVE as its interrupt controller will cause a
(nested) host kernel crash. To fix this, we change most of the places
where the current code calls xive_enabled() to select between the
XICS-on-XIVE emulation and the plain XICS emulation to call a new
function, xics_on_xive(), which returns false in a guest.

However, there is a further twist. The plain XICS emulation has some
functions which are used in real mode and access the underlying XICS
controller (the interrupt controller of the host) directly. In the
case of a nested hypervisor, this means doing XICS hypercalls
directly. When the nested host is using XIVE as its interrupt
controller, these hypercalls will fail. Therefore this also adds
checks in the places where the XICS emulation wants to access the
underlying interrupt controller directly, and if that is XIVE, makes
the code use the virtual mode fallback paths, which call generic
kernel infrastructure rather than doing direct XICS access.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# dceadcf9 13-Dec-2018 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct

The kvmppc_ops struct is used to store function pointers to kvm
implementation specific functions.

Introduce two new functions load_from_eaddr and store_to_eaddr to be
used to load from and store to a guest effective address respectively.

Also implement these for the kvm-hv module. If we are using the radix
mmu then we can call the functions to access quadrant 1 and 2.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# f032b734 11-Dec-2018 Bharata B Rao <bharata@linux.ibm.com>

KVM: PPC: Pass change type down to memslot commit function

Currently, kvm_arch_commit_memory_region() gets called with a
parameter indicating what type of change is being made to the memslot,
but it doesn't pass it down to the platform-specific memslot commit
functions. This adds the `change' parameter to the lower-level
functions so that they can use it in future.

[paulus@ozlabs.org - fix book E also.]

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# aa069a99 21-Sep-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization

With this, userspace can enable a KVM-HV guest to run nested guests
under it.

The administrator can control whether any nested guests can be run;
setting the "nested" module parameter to false prevents any guests
becoming nested hypervisors (that is, any attempt to enable the nested
capability on a guest will fail). Guests which are already nested
hypervisors will continue to be so.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 95a6432c 07-Oct-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests

This creates an alternative guest entry/exit path which is used for
radix guests on POWER9 systems when we have indep_threads_mode=Y. In
these circumstances there is exactly one vcpu per vcore and there is
no coordination required between vcpus or vcores; the vcpu can enter
the guest without needing to synchronize with anything else.

The new fast path is implemented almost entirely in C in book3s_hv.c
and runs with the MMU on until the guest is entered. On guest exit
we use the existing path until the point where we are committed to
exiting the guest (as distinct from handling an interrupt in the
low-level code and returning to the guest) and we have pulled the
guest context from the XIVE. At that point we check a flag in the
stack frame to see whether we came in via the old path and the new
path; if we came in via the new path then we go back to C code to do
the rest of the process of saving the guest context and restoring the
host context.

The C code is split into separate functions for handling the
OS-accessible state and the hypervisor state, with the idea that the
latter can be replaced by a hypercall when we implement nested
virtualization.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix CONFIG_ALTIVEC=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# f7035ce9 07-Oct-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code

This is based on a patch by Suraj Jitindar Singh.

This moves the code in book3s_hv_rmhandlers.S that generates an
external, decrementer or privileged doorbell interrupt just before
entering the guest to C code in book3s_hv_builtin.c. This is to
make future maintenance and modification easier. The algorithm
expressed in the C code is almost identical to the previous
algorithm.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a3ac077b 10-Sep-2018 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Remove redundand permission bits removal

The kvmppc_gpa_to_ua() helper itself takes care of the permission
bits in the TCE and yet every single caller removes them.

This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs
(which are GPAs + TCE permission bits) to make the callers simpler.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 42de7b9e 10-Sep-2018 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Validate TCEs against preregistered memory page sizes

The userspace can request an arbitrary supported page size for a DMA
window and this works fine as long as the mapped memory is backed with
the pages of the same or bigger size; if this is not the case,
mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with
dangerously incorrect TCEs.

However since it is quite easy to misconfigure the KVM and we do not do
reverts to all changes made to TCE tables if an error happens in a middle,
we better do the acceptable page size validation before we even touch
the tables.

This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes
against the preregistered memory page sizes.

Since the new check uses real/virtual mode helpers, this renames
kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode
case and mirrors it for the virtual mode under the old name. The real
mode handler is not used for the virtual mode as:
1. it uses _lockless() list traversing primitives instead of RCU;
2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which
virtual mode does not have to use and since on POWER9+radix only virtual
mode handlers actually work, we do not want to slow down that path even
a bit.

This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators
are static now.

From now on the attempts on mapping IOMMU pages bigger than allowed
will result in KVM exit.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix KVM_HV=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# acc9eb93 20-May-2018 Simon Guo <wei.guo.simon@gmail.com>

KVM: PPC: Reimplement LOAD_VMX/STORE_VMX instruction mmio emulation with analyse_instr() input

This patch reimplements LOAD_VMX/STORE_VMX MMIO emulation with
analyse_instr() input. When emulating the store, the VMX reg will need to
be flushed so that the right reg val can be retrieved before writing to
IO MEM.

This patch also adds support for lvebx/lvehx/lvewx/stvebx/stvehx/stvewx
MMIO emulation. To meet the requirement of handling different element
sizes, kvmppc_handle_load128_by2x64()/kvmppc_handle_store128_by2x64()
were replaced with kvmppc_handle_vmx_load()/kvmppc_handle_vmx_store().

The framework used is similar to VSX instruction MMIO emulation.

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 2e6baa46 20-May-2018 Simon Guo <wei.guo.simon@gmail.com>

KVM: PPC: Add giveup_ext() hook to PPC KVM ops

Currently HV will save math regs(FP/VEC/VSX) when trap into host. But
PR KVM will only save math regs when qemu task switch out of CPU, or
when returning from qemu code.

To emulate FP/VEC/VSX mmio load, PR KVM need to make sure that math
regs were flushed firstly and then be able to update saved VCPU
FPR/VEC/VSX area reasonably.

This patch adds giveup_ext() field to KVM ops. Only PR KVM has non-NULL
giveup_ext() ops. kvmppc_complete_mmio_load() can invoke that hook
(when not NULL) to flush math regs accordingly, before updating saved
register vals.

Math regs flush is also necessary for STORE, which will be covered
in later patch within this patch series.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 70923603 20-May-2018 Simon Guo <wei.guo.simon@gmail.com>

KVM: PPC: Reimplement non-SIMD LOAD/STORE instruction mmio emulation with analyse_instr() input

This patch reimplements non-SIMD LOAD/STORE instruction MMIO emulation
with analyse_instr() input. It utilizes the BYTEREV/UPDATE/SIGNEXT
properties exported by analyse_instr() and invokes
kvmppc_handle_load(s)/kvmppc_handle_store() accordingly.

It also moves CACHEOP type handling into the skeleton.

instruction_type within kvm_ppc.h is renamed to avoid conflict with
sstep.h.

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d2e60075 13-Feb-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/64: Use array of paca pointers and allocate pacas individually

Change the paca array into an array of pointers to pacas. Allocate
pacas individually.

This allows flexibility in where the PACAs are allocated. Future work
will allocate them node-local. Platforms that don't have address limits
on PACAs would be able to defer PACA allocations until later in boot
rather than allocate all possible ones up-front then freeing unused.

This is slightly more overhead (one additional indirection) for cross
CPU paca references, but those aren't too common.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 39c983ea 21-Feb-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Remove unused kvm_unmap_hva callback

Since commit fb1522e099f0 ("KVM: update to new mmu_notifier semantic
v2", 2017-08-31), the MMU notifier code in KVM no longer calls the
kvm_unmap_hva callback. This removes the PPC implementations of
kvm_unmap_hva().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 09f98496 03-Feb-2018 Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>

KVM: PPC: Book3S: Add MMIO emulation for VMX instructions

This patch provides the MMIO load/store vector indexed
X-Form emulation.

Instructions implemented:
lvx: the quadword in storage addressed by the result of EA &
0xffff_ffff_ffff_fff0 is loaded into VRT.

stvx: the contents of VRS are stored into the quadword in storage
addressed by the result of EA & 0xffff_ffff_ffff_fff0.

Reported-by: Gopesh Kumar Chaudhary <gopchaud@in.ibm.com>
Reported-by: Balamuruhan S <bala24@linux.vnet.ibm.com>
Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 4e26bc4a 19-Dec-2017 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/64: Rename soft_enabled to irq_soft_mask

Rename the paca->soft_enabled to paca->irq_soft_mask as it is no
longer used as a flag for interrupt state, but a mask.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 0b63acf4 19-Dec-2017 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/64: Move set_soft_enabled() and rename

Move set_soft_enabled() from powerpc/kernel/irq.c to asm/hw_irq.c, to
encourage updates to paca->soft_enabled done via these access
function. Add "memory" clobber to hint compiler since
paca->soft_enabled memory is the target here.

Renaming it as soft_enabled_set() will make namespaces works better as
prefix than a postfix when new soft_enabled manipulation functions are
introduced.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# c2e480ba 19-Dec-2017 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/64: Add #defines for paca->soft_enabled flags

Two #defines IRQS_ENABLED and IRQS_DISABLED are added to be used when
updating paca->soft_enabled. Replace the hardcoded values used when
updating paca->soft_enabled with IRQ_(EN|DIS)ABLED #define. No logic
change.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ded13fc1 21-Nov-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts

This fixes two errors that prevent a guest using the HPT MMU from
successfully migrating to a POWER9 host in radix MMU mode, or resizing
its HPT when running on a radix host.

The first bug was that commit 8dc6cca556e4 ("KVM: PPC: Book3S HV:
Don't rely on host's page size information", 2017-09-11) missed two
uses of hpte_base_page_size(), one in the HPT rehashing code and
one in kvm_htab_write() (which is used on the destination side in
migrating a HPT guest). Instead we use kvmppc_hpte_base_page_shift().
Having the shift count means that we can use left and right shifts
instead of multiplication and division in a few places.

Along the way, this adds a check in kvm_htab_write() to ensure that the
page size encoding in the incoming HPTEs is recognized, and if not
return an EINVAL error to userspace.

The second bug was that kvm_htab_write was performing some but not all
of the functions of kvmhv_setup_mmu(), resulting in the destination VM
being left in radix mode as far as the hardware is concerned. The
simplest fix for now is make kvm_htab_write() call
kvmppc_setup_partition_table() like kvmppc_hv_setup_htab_rma() does.
In future it would be better to refactor the code more extensively
to remove the duplication.

Fixes: 8dc6cca556e4 ("KVM: PPC: Book3S HV: Don't rely on host's page size information")
Fixes: 7a84084c6054 ("KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9")
Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 18c3640c 13-Sep-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host

This sets up the machinery for switching a guest between HPT (hashed
page table) and radix MMU modes, so that in future we can run a HPT
guest on a radix host on POWER9 machines.

* The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or
radix mode, on a radix host.

* The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9
with HV KVM on a radix host.

* The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a
radix host.

* The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the
guest to HPT mode and allocate a HPT.

* For simplicity, we now allocate the rmap array for each memslot,
even on a radix host, since it will be needed if the guest switches
to HPT mode.

* Since we cannot yet run a HPT guest on a radix host, the KVM_RUN
ioctl will return an EINVAL error in that case.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 3c313524 05-Feb-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode

This allows userspace to set the desired virtual SMT (simultaneous
multithreading) mode for a VM, that is, the number of VCPUs that
get assigned to each virtual core. Previously, the virtual SMT mode
was fixed to the number of threads per subcore, and if userspace
wanted to have fewer vcpus per vcore, then it would achieve that by
using a sparse CPU numbering. This had the disadvantage that the
vcpu numbers can get quite large, particularly for SMT1 guests on
a POWER8 with 8 threads per core. With this patch, userspace can
set its desired virtual SMT mode and then use contiguous vcpu
numbering.

On POWER8, where the threading mode is "strict", the virtual SMT mode
must be less than or equal to the number of threads per subcore. On
POWER9, which implements a "loose" threading mode, the virtual SMT
mode can be any power of 2 between 1 and 8, even though there is
effectively one thread per subcore, since the threads are independent
and can all be in different partitions.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 5af50993 05-Apr-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller

This patch makes KVM capable of using the XIVE interrupt controller
to provide the standard PAPR "XICS" style hypercalls. It is necessary
for proper operations when the host uses XIVE natively.

This has been lightly tested on an actual system, including PCI
pass-through with a TG3 device.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build
failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and
adding empty stubs for the kvm_xive_xxx() routines, fixup subject,
integrate fixes from Paul for building PR=y HV=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 121f80ba 21-Mar-2017 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# b1af23d8 21-Mar-2017 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: iommu: Unify TCE checking

This reworks helpers for checking TCE update parameters in way they
can be used in KVM.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 503bfcbe 21-Mar-2017 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Pass kvm* to kvmppc_find_table()

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 6f63e81b 21-Feb-2017 Bin Lu <lblulb@linux.vnet.ibm.com>

KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions

This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.

The instructions that this adds emulation for are:

- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x

[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]

Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 307d9279 22-Mar-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Provide functions for queueing up FP/VEC/VSX unavailable interrupts

This provides functions that can be used for generating interrupts
indicating that a given functional unit (floating point, vector, or
VSX) is unavailable. These functions will be used in instruction
emulation code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d381d7ca 05-Apr-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Consolidate variants of real-mode MMIOs

We have all sort of variants of MMIO accessors for the real mode
instructions. This creates a clean set of accessors based on
Linux normal naming conventions, replacing all occurrences of
the old ones in the tree.

I have purposefully removed the "out/in" variants in favor of
only including __raw variants. Any code using these is already
pretty much hand tuned to operate in a very specific environment.
I've fixed up the 2 users (only one of them actually needed
a barrier in the first place).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# f50d6bd3 05-Apr-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/kvm: Remove obsolete kvm_vm_ioctl_xics_irq declaration

The function doesn't exist anymore

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 936774cd 05-Apr-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/kvm: Make kvmppc_xics_create_icp static

It's only used within the same file it's defined

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5e985969 19-Dec-2016 David Gibson <david@gibson.dropbear.id.au>

KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation

This adds a not yet working outline of the HPT resizing PAPR
extension. Specifically it adds the necessary ioctl() functions,
their basic steps, the work function which will handle preparation for
the resize, and synchronization between these, the guest page fault
path and guest HPT update path.

The actual guts of the implementation isn't here yet, so for now the
calls will always fail.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# f98a8bf9 19-Dec-2016 David Gibson <david@gibson.dropbear.id.au>

KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size

The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
table (HPT) that userspace expects a guest VM to have, and is also used to
clear that HPT when necessary (e.g. guest reboot).

At present, once the ioctl() is called for the first time, the HPT size can
never be changed thereafter - it will be cleared but always sized as from
the first call.

With upcoming HPT resize implementation, we're going to need to allow
userspace to resize the HPT at reset (to change it back to the default size
if the guest changed it).

So, we need to allow this ioctl() to change the HPT size.

This patch also updates Documentation/virtual/kvm/api.txt to reflect
the new behaviour. In fact the documentation was already slightly
incorrect since 572abd5 "KVM: PPC: Book3S HV: Don't fall back to
smaller HPT size in allocation ioctl"

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# aae0777f 19-Dec-2016 David Gibson <david@gibson.dropbear.id.au>

KVM: PPC: Book3S HV: Split HPT allocation from activation

Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
and sets it up as the active page table for a VM. For the upcoming HPT
resize implementation we're going to want to allocate HPTs separately from
activating them.

So, split the allocation itself out into kvmppc_allocate_hpt() and perform
the activation with a new kvmppc_set_hpt() function. Likewise we split
kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
which unsets it as an active HPT, then frees it.

We also move the logic to fall back to smaller HPT sizes if the first try
fails into the single caller which used that behaviour,
kvmppc_hv_setup_htab_rma(). This introduces a slight semantic change, in
that previously if the initial attempt at CMA allocation failed, we would
fall back to attempting smaller sizes with the page allocator. Now, we
try first CMA, then the page allocator at each size. As far as I can tell
this change should be harmless.

To match, we make kvmppc_free_hpt() just free the actual HPT itself. The
call to kvmppc_free_lpid() that was there, we move to the single caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# db9a290d 19-Dec-2016 David Gibson <david@gibson.dropbear.id.au>

KVM: PPC: Book3S HV: Rename kvm_alloc_hpt() for clarity

The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at
all obvious from the name. In practice kvmppc_alloc_hpt() allocates an HPT
by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate
it with CMA only.

To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma().
Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# c9270132 30-Jan-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU

This adds two capabilities and two ioctls to allow userspace to
find out about and configure the POWER9 MMU in a guest. The two
capabilities tell userspace whether KVM can support a guest using
the radix MMU, or using the hashed page table (HPT) MMU with a
process table and segment tables. (Note that the MMUs in the
POWER9 processor cores do not use the process and segment tables
when in HPT mode, but the nest MMU does).

The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify
whether a guest will use the radix MMU or the HPT MMU, and to
specify the size and location (in guest space) of the process
table.

The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about
the radix MMU. It returns a list of supported radix tree geometries
(base page size and number of bits indexed at each level of the
radix tree) and the encoding used to specify the various page
sizes for the TLB invalidate entry instruction.

Initially, both capabilities return 0 and the ioctls return -EINVAL,
until the necessary infrastructure for them to operate correctly
is added.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e34af784 30-Nov-2016 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h

This moves the prototypes for functions that are only called from
assembler code out of asm/asm-prototypes.h into asm/kvm_ppc.h.
The prototypes were added in commit ebe4535fbe7a ("KVM: PPC:
Book3S HV: sparse: prototypes for functions called from assembler",
2016-10-10), but given that the functions are KVM functions,
having them in a KVM header will be better for long-term
maintenance.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# f725758b 17-Nov-2016 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9

POWER9 includes a new interrupt controller, called XIVE, which is
quite different from the XICS interrupt controller on POWER7 and
POWER8 machines. KVM-HV accesses the XICS directly in several places
in order to send and clear IPIs and handle interrupts from PCI
devices being passed through to the guest.

In order to make the transition to XIVE easier, OPAL firmware will
include an emulation of XICS on top of XIVE. Access to the emulated
XICS is via OPAL calls. The one complication is that the EOI
(end-of-interrupt) function can now return a value indicating that
another interrupt is pending; in this case, the XIVE will not signal
an interrupt in hardware to the CPU, and software is supposed to
acknowledge the new interrupt without waiting for another interrupt
to be delivered in hardware.

This adapts KVM-HV to use the OPAL calls on machines where there is
no XICS hardware. When there is no XICS, we look for a device-tree
node with "ibm,opal-intc" in its compatible property, which is how
OPAL indicates that it provides XICS emulation.

In order to handle the EOI return value, kvmppc_read_intr() has
become kvmppc_read_one_intr(), with a boolean variable passed by
reference which can be set by the EOI functions to indicate that
another interrupt is pending. The new kvmppc_read_intr() keeps
calling kvmppc_read_one_intr() until there are no more interrupts
to process. The return value from kvmppc_read_intr() is the
largest non-zero value of the returns from kvmppc_read_one_intr().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 5d375199 18-Aug-2016 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Set server for passed-through interrupts

When a guest has a PCI pass-through device with an interrupt, it
will direct the interrupt to a particular guest VCPU. In fact the
physical interrupt might arrive on any CPU, and then get
delivered to the target VCPU in the emulated XICS (guest interrupt
controller), and eventually delivered to the target VCPU.

Now that we have code to handle device interrupts in real mode
without exiting to the host kernel, there is an advantage to having
the device interrupt arrive on the same sub(core) as the target
VCPU is running on. In this situation, the interrupt can be
delivered to the target VCPU without any exit to the host kernel
(using a hypervisor doorbell interrupt between threads if
necessary).

This patch aims to get passed-through device interrupts arriving
on the correct core by setting the interrupt server in the real
hardware XICS for the interrupt to the first thread in the (sub)core
where its target VCPU is running. We do this in the real-mode H_EOI
code because the H_EOI handler already needs to look at the
emulated ICS state for the interrupt (whereas the H_XIRR handler
doesn't), and we know we are running in the target VCPU context
at that point.

We set the server CPU in hardware using an OPAL call, regardless of
what the IRQ affinity mask for the interrupt says, and without
updating the affinity mask. This amounts to saying that when an
interrupt is passed through to a guest, as a matter of policy we
allow the guest's affinity for the interrupt to override the host's.

This is inspired by an earlier patch from Suresh Warrier, although
none of this code came from that earlier patch.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 644abbb2 18-Aug-2016 Suresh Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Tunable to disable KVM IRQ bypass

Add a module parameter kvm_irq_bypass for kvm_hv.ko to
disable IRQ bypass for passthrough interrupts. The default
value of this tunable is 1 - that is enable the feature.

Since the tunable is used by built-in kernel code, we use
the module_param_cb macro to achieve this.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# f7af5209 18-Aug-2016 Suresh Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Complete passthrough interrupt in host

In existing real mode ICP code, when updating the virtual ICP
state, if there is a required action that cannot be completely
handled in real mode, as for instance, a VCPU needs to be woken
up, flags are set in the ICP to indicate the required action.
This is checked when returning from hypercalls to decide whether
the call needs switch back to the host where the action can be
performed in virtual mode. Note that if h_ipi_redirect is enabled,
real mode code will first try to message a free host CPU to
complete this job instead of returning the host to do it ourselves.

Currently, the real mode PCI passthrough interrupt handling code
checks if any of these flags are set and simply returns to the host.
This is not good enough as the trap value (0x500) is treated as an
external interrupt by the host code. It is only when the trap value
is a hypercall that the host code searches for and acts on unfinished
work by calling kvmppc_xics_rm_complete.

This patch introduces a special trap BOOK3S_INTERRUPT_HV_RM_HARD
which is returned by KVM if there is unfinished business to be
completed in host virtual mode after handling a PCI passthrough
interrupt. The host checks for this special interrupt condition
and calls into the kvmppc_xics_rm_complete, which is made an
exported function for this reason.

[paulus@ozlabs.org - moved logic to set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
in book3s_hv_rmhandlers.S into the end of kvmppc_check_wake_reason.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# e3c13e56 18-Aug-2016 Suresh Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Handle passthrough interrupts in guest

Currently, KVM switches back to the host to handle any external
interrupt (when the interrupt is received while running in the
guest). This patch updates real-mode KVM to check if an interrupt
is generated by a passthrough adapter that is owned by this guest.
If so, the real mode KVM will directly inject the corresponding
virtual interrupt to the guest VCPU's ICS and also EOI the interrupt
in hardware. In short, the interrupt is handled entirely in real
mode in the guest context without switching back to the host.

In some rare cases, the interrupt cannot be completely handled in
real mode, for instance, a VCPU that is sleeping needs to be woken
up. In this case, KVM simply switches back to the host with trap
reason set to 0x500. This works, but it is clearly not very efficient.
A following patch will distinguish this case and handle it
correctly in the host. Note that we can use the existing
check_too_hard() routine even though we are not in a hypercall to
determine if there is unfinished business that needs to be
completed in host virtual mode.

The patch assumes that the mapping between hardware interrupt IRQ
and virtual IRQ to be injected to the guest already exists for the
PCI passthrough interrupts that need to be handled in real mode.
If the mapping does not exist, KVM falls back to the default
existing behavior.

The KVM real mode code reads mappings from the mapped array in the
passthrough IRQ map without taking any lock. We carefully order the
loads and stores of the fields in the kvmppc_irq_map data structure
using memory barriers to avoid an inconsistent mapping being seen by
the reader. Thus, although it is possible to miss a map entry, it is
not possible to read a stale value.

[paulus@ozlabs.org - get irq_chip from irq_map rather than pimap,
pulled out powernv eoi change into a separate patch, made
kvmppc_read_intr get the vcpu from the paca rather than being
passed in, rewrote the logic at the end of kvmppc_read_intr to
avoid deep indentation, simplified logic in book3s_hv_rmhandlers.S
since we were always restoring SRR0/1 anyway, get rid of the cached
array (just use the mapped array), removed the kick_all_cpus_sync()
call, clear saved_xirr PACA field when we handle the interrupt in
real mode, fix compilation with CONFIG_KVM_XICS=n.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 8daaafc8 18-Aug-2016 Suresh Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Introduce kvmppc_passthru_irqmap

This patch introduces an IRQ mapping structure, the
kvmppc_passthru_irqmap structure that is to be used
to map the real hardware IRQ in the host with the virtual
hardware IRQ (gsi) that is injected into a guest by KVM for
passthrough adapters.

Currently, we assume a separate IRQ mapping structure for
each guest. Each kvmppc_passthru_irqmap has a mapping arrays,
containing all defined real<->virtual IRQs.

[paulus@ozlabs.org - removed irq_chip field from struct
kvmppc_passthru_irqmap; changed parameter for
kvmppc_get_passthru_irqmap from struct kvm_vcpu * to struct
kvm *, removed small cached array.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 9576730d 18-Aug-2016 Suresh Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: select IRQ_BYPASS_MANAGER

Select IRQ_BYPASS_MANAGER for PPC when CONFIG_KVM is set.
Add the PPC producer functions for add and del producer.

[paulus@ozlabs.org - Moved new functions from book3s.c to powerpc.c
so booke compiles; added kvm_arch_has_irq_bypass implementation.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 58ded420 29-Feb-2016 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Add support for 64bit TCE windows

The existing KVM_CREATE_SPAPR_TCE only supports 32bit windows which is not
enough for directly mapped windows as the guest can get more than 4GB.

This adds KVM_CREATE_SPAPR_TCE_64 ioctl and advertises it
via KVM_CAP_SPAPR_TCE_64 capability. The table size is checked against
the locked memory limit.

Since 64bit windows are to support Dynamic DMA windows (DDW), let's add
@bus_offset and @page_shift which are also required by DDW.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 520fe9c6 21-Dec-2015 Suresh E. Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Add tunable to control H_IPI redirection

Redirecting the wakeup of a VCPU from the H_IPI hypercall to
a core running in the host is usually a good idea, most workloads
seemed to benefit. However, in one heavily interrupt-driven SMT1
workload, some regression was observed. This patch adds a kvm_hv
module parameter called h_ipi_redirect to control this feature.

The default value for this tunable is 1 - that is enable the feature.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 0c2a6606 17-Dec-2015 Suresh Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Host side kick VCPU when poked by real-mode KVM

This patch adds the support for the kick VCPU operation for
kvmppc_host_rm_ops. The kvmppc_xics_ipi_action() function
provides the function to be invoked for a host side operation
when poked by the real mode KVM. This is initiated by KVM by
sending an IPI to any free host core.

KVM real mode must set the rm_action to XICS_RM_KICK_VCPU and
rm_data to point to the VCPU to be woken up before sending the IPI.
Note that we have allocated one kvmppc_host_rm_core structure
per core. The above values need to be set in the structure
corresponding to the core to which the IPI will be sent.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 79b6c247 17-Dec-2015 Suresh Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Host-side RM data structures

This patch defines the data structures to support the setting up
of host side operations while running in real mode in the guest,
and also the functions to allocate and free it.

The operations are for now limited to virtual XICS operations.
Currently, we have only defined one operation in the data
structure:
- Wake up a VCPU sleeping in the host when it
receives a virtual interrupt

The operations are assigned at the core level because PowerKVM
requires that the host run in SMT off mode. For each core,
we will need to manage its state atomically - where the state
is defined by:
1. Is the core running in the host?
2. Is there a Real Mode (RM) operation pending on the host?

Currently, core state is only managed at the whole-core level
even when the system is in split-core mode. This just limits
the number of free or "available" cores in the host to perform
any host-side operations.

The kvmppc_host_rm_core.rm_data allows any data to be passed by
KVM in real mode to the host core along with the operation to
be performed.

The kvmppc_host_rm_ops structure is allocated the very first time
a guest VM is started. Initial core state is also set - all online
cores are in the host. This structure is never deleted, not even
when there are no active guests. However, it needs to be freed
when the module is unloaded because the kvmppc_host_rm_ops_hv
can contain function pointers to kvm-hv.ko functions for the
different supported host operations.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# d3695aa4 14-Feb-2016 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Add support for multiple-TCE hcalls

This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition between kernel and user space.

The current implementation of kvmppc_h_stuff_tce() allows it to be
executed in both real and virtual modes so there is one helper.
The kvmppc_rm_h_put_tce_indirect() needs to translate the guest address
to the host address and since the translation is different, there are
2 helpers - one for each mode.

This implements the KVM_CAP_PPC_MULTITCE capability. When present,
the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE if these
are enabled by the userspace via KVM_CAP_PPC_ENABLE_HCALL.
If they can not be handled by the kernel, they are passed on to
the user space. The user space still has to have an implementation
for these.

Both HV and PR-syle KVM are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 5ee7af18 14-Feb-2016 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers

Upcoming multi-tce support (H_PUT_TCE_INDIRECT/H_STUFF_TCE hypercalls)
will validate TCE (not to have unexpected bits) and IO address
(to be within the DMA window boundaries).

This introduces helpers to validate TCE and IO address. The helpers are
exported as they compile into vmlinux (to work in realmode) and will be
used later by KVM kernel module in virtual mode.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# ba049e93 15-Jan-2016 Dan Williams <dan.j.williams@intel.com>

kvm: rename pfn_t to kvm_pfn_t

To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 18):

The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f36f3f28 18-May-2015 Paolo Bonzini <pbonzini@redhat.com>

KVM: add "new" argument to kvm_arch_commit_memory_region

This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot. It will simplify the code when more
than one address space will be supported.

Unfortunately, the "const"ness of the new argument must be casted
away in two places. Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.

Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 09170a49 18-May-2015 Paolo Bonzini <pbonzini@redhat.com>

KVM: const-ify uses of struct kvm_userspace_memory_region

Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents. Add const to
enforce this.

In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e928e9cb 20-Mar-2015 Michael Ellerman <michael@ellerman.id.au>

KVM: PPC: Book3S HV: Add fast real-mode H_RANDOM implementation.

Some PowerNV systems include a hardware random-number generator.
This HWRNG is present on POWER7+ and POWER8 chips and is capable of
generating one 64-bit random number every microsecond. The random
numbers are produced by sampling a set of 64 unstable high-frequency
oscillators and are almost completely entropic.

PAPR defines an H_RANDOM hypercall which guests can use to obtain one
64-bit random sample from the HWRNG. This adds a real-mode
implementation of the H_RANDOM hypercall. This hypercall was
implemented in real mode because the latency of reading the HWRNG is
generally small compared to the latency of a guest exit and entry for
all the threads in the same virtual core.

Userspace can detect the presence of the HWRNG and the H_RANDOM
implementation by querying the KVM_CAP_PPC_HWRNG capability. The
H_RANDOM hypercall implementation will only be invoked when the guest
does an H_RANDOM hypercall if userspace first enables the in-kernel
H_RANDOM implementation using the KVM_CAP_PPC_ENABLE_HCALL capability.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# c17b98cf 02-Dec-2014 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S HV: Remove code for PPC970 processors

This removes the code that was added to enable HV KVM to work
on PPC970 processors. The PPC970 is an old CPU that doesn't
support virtualizing guest memory. Removing PPC970 support also
lets us remove the code for allocating and managing contiguous
real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
case, the code for pinning pages of guest memory when first
accessed and keeping track of which pages have been pinned, and
the code for handling H_ENTER hypercalls in virtual mode.

Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
The KVM_CAP_PPC_RMA capability now always returns 0.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 57128468 22-Sep-2014 Andres Lagar-Cavilla <andreslc@google.com>

kvm: Fix page ageing bugs

1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# a59c1d9e 09-Sep-2014 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/kvm: support to handle sw breakpoint

This patch adds kernel side support for software breakpoint.
Design is that, by using an illegal instruction, we trap to hypervisor
via Emulation Assistance interrupt, where we check for the illegal instruction
and accordingly we return to Host or Guest. Patch also adds support for
software breakpoint in PR KVM.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# d02d4d15 01-Sep-2014 Mihai Caraman <mihai.caraman@freescale.com>

KVM: PPC: Remove the tasklet used by the hrtimer

Powerpc timer implementation is a copycat version of s390. Now that they removed
the tasklet with commit ea74c0ea1b24a6978a6ebc80ba4dbc7b7848b32d follow this
optimization.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 2f699a59 13-Aug-2014 Bharat Bhushan <Bharat.Bhushan@freescale.com>

KVM: PPC: BOOKE: Emulate debug registers and exception

This patch emulates debug registers and debug exception
to support guest using debug resource. This enables running
gdb/kgdb etc in guest.

On BOOKE architecture we cannot share debug resources between QEMU and
guest because:
When QEMU is using debug resources then debug exception must
be always enabled. To achieve this we set MSR_DE and also set
MSRP_DEP so guest cannot change MSR_DE.

When emulating debug resource for guest we want guest
to control MSR_DE (enable/disable debug interrupt on need).

So above mentioned two configuration cannot be supported
at the same time. So the result is that we cannot share
debug resources between QEMU and Guest on BOOKE architecture.

In the current design QEMU gets priority over guest, this means that if
QEMU is using debug resources then guest cannot use them and if guest is
using debug resource then QEMU can overwrite them.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 5a484c7c 30-Jul-2014 Bharat Bhushan <Bharat.Bhushan@freescale.com>

KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr

This are not specific to e500hv but applicable for bookehv
(As per comment from Scott Wood on my patch
"kvm: ppc: bookehv: Added wrapper macros for shadow registers")

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# ce91ddc4 28-Jul-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Remove DCR handling

DCR handling was only needed for 440 KVM. Since we removed it, we can also
remove handling of DCR accesses.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 8de12015 18-Jun-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Expose helper functions for data/inst faults

We're going to implement guest code interpretation in KVM for some rare
corner cases. This code needs to be able to inject data and instruction
faults into the guest when it encounters them.

Expose generic APIs to do this in a reasonably subarch agnostic fashion.

Signed-off-by: Alexander Graf <agraf@suse.de>


# d69614a2 18-Jun-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Separate loadstore emulation from priv emulation

Today the instruction emulator can get called via 2 separate code paths. It
can either be called by MMIO emulation detection code or by privileged
instruction traps.

This is bad, as both code paths prepare the environment differently. For MMIO
emulation we already know the virtual address we faulted on, so instructions
there don't have to actually fetch that information.

Split out the two separate use cases into separate files.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 35c4a733 20-Jun-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Move kvmppc_ld/st to common code

We have enough common infrastructure now to resolve GVA->GPA mappings at
runtime. With this we can move our book3s specific helpers to load / store
in guest virtual address space to common code as well.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 7d15c06f 20-Jun-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Implement kvmppc_xlate for all targets

We have a nice API to find the translated GPAs of a GVA including protection
flags. So far we only use it on Book3S, but there's no reason the same shouldn't
be used on BookE as well.

Implement a kvmppc_xlate() version for BookE and clean it up to make it more
readable in general.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 51f04726 23-Jul-2014 Mihai Caraman <mihai.caraman@freescale.com>

KVM: PPC: Allow kvmppc_get_last_inst() to fail

On book3e, guest last instruction is read on the exit path using load
external pid (lwepx) dedicated instruction. This load operation may fail
due to TLB eviction and execute-but-not-read entries.

This patch lay down the path for an alternative solution to read the guest
last instruction, by allowing kvmppc_get_lat_inst() function to fail.
Architecture specific implmentations of kvmppc_load_last_inst() may read
last guest instruction and instruct the emulation layer to re-execute the
guest in case of failure.

Make kvmppc_get_last_inst() definition common between architectures.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 34f754b9 17-Jul-2014 Bharat Bhushan <Bharat.Bhushan@freescale.com>

kvm: ppc: Add SPRN_EPR get helper function

kvmppc_set_epr() is already defined in asm/kvm_ppc.h, So
rename and move get_epr helper function to same file.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
[agraf: remove duplicate return]
Signed-off-by: Alexander Graf <agraf@suse.de>


# dc168549 17-Jul-2014 Bharat Bhushan <Bharat.Bhushan@freescale.com>

kvm: ppc: booke: Add shared struct helpers of SPRN_ESR

Add and use kvmppc_set_esr() and kvmppc_get_esr() helper functions

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 1dc0c5b8 17-Jul-2014 Bharat Bhushan <Bharat.Bhushan@freescale.com>

kvm: ppc: bookehv: Added wrapper macros for shadow registers

There are shadow registers like, GSPRG[0-3], GSRR0, GSRR1 etc on
BOOKE-HV and these shadow registers are guest accessible.
So these shadow registers needs to be updated on BOOKE-HV.
This patch adds new macro for get/set helper of shadow register .

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# ae2113a4 01-Jun-2014 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled

This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
capability is used to enable or disable in-kernel handling of an
hcall, that the hcall is actually implemented by the kernel.
If not an EINVAL error is returned.

This also checks the default-enabled list of hcalls and prints a
warning if any hcall there is not actually implemented.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 5deb8e7a 24-Apr-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Make shared struct aka magic page guest endian

The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.

When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.

Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.

For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.

For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 441c19c8 23-May-2014 Michael Ellerman <mpe@ellerman.id.au>

powerpc/kvm/book3s_hv: Rework the secondary inhibit code

As part of the support for split core on POWER8, we want to be able to
block splitting of the core while KVM VMs are active.

The logic to do that would be exactly the same as the code we currently
have for inhibiting onlining of secondaries.

Instead of adding an identical mechanism to block split core, rework the
secondary inhibit code to be a "HV KVM is active" check. We can then use
that in both the cpu hotplug code and the upcoming split core code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 69e9fbb2 21-Feb-2014 Laurent Dufour <ldufour@linux.vnet.ibm.com>

KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE

This introduces the H_GET_TCE hypervisor call, which is basically the
reverse of H_PUT_TCE, as defined in the Power Architecture Platform
Requirements (PAPR).

The hcall H_GET_TCE is required by the kdump kernel, which uses it to
retrieve TCEs set up by the previous (panicked) kernel.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6c85f52b 09-Jan-2014 Scott Wood <scottwood@freescale.com>

kvm/ppc: IRQ disabling cleanup

Simplify the handling of lazy EE by going directly from fully-enabled
to hard-disabled. This replaces the lazy_irq_pending() check
(including its misplaced kvm_guest_exit() call).

As suggested by Tiejun Chen, move the interrupt disabling into
kvmppc_prepare_to_enter() rather than have each caller do it. Also
move the IRQ enabling on heavyweight exit into
kvmppc_prepare_to_enter().

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 73601775 09-Jan-2014 Cédric Le Goater <clg@fr.ibm.com>

KVM: PPC: Book3S: MMIO emulation support for little endian guests

MMIO emulation reads the last instruction executed by the guest
and then emulates. If the guest is running in Little Endian order,
or more generally in a different endian order of the host, the
instruction needs to be byte-swapped before being emulated.

This patch adds a helper routine which tests the endian order of
the host and the guest in order to decide whether a byteswap is
needed or not. It is then used to byteswap the last instruction
of the guest in the endian order of the host before MMIO emulation
is performed.

Finally, kvmppc_handle_load() of kvmppc_handle_store() are modified
to reverse the endianness of the MMIO if required.

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
[agraf: add booke handling]
Signed-off-by: Alexander Graf <agraf@suse.de>


# a78b55d1 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: book3s: drop is_hv_enabled

drop is_hv_enabled, because that should not be a callback property

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# cbbc58d4 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: book3s: Allow the HV and PR selection per virtual machine

This moves the kvmppc_ops callbacks to be a per VM entity. This
enables us to select HV and PR mode when creating a VM. We also
allow both kvm-hv and kvm-pr kernel module to be loaded. To
achieve this we move /dev/kvm ownership to kvm.ko module. Depending on
which KVM mode we select during VM creation we take a reference
count on respective module

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: fix coding style]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 5587027c 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: Add struct kvm arg to memslot APIs

We will use that in the later patch to find the kvm ops handler

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 699cc876 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: book3s: Add is_hv_enabled to kvmppc_ops

This help us to identify whether we are running with hypervisor mode KVM
enabled. The change is needed so that we can have both HV and PR kvm
enabled in the same kernel.

If both HV and PR KVM are included, interrupts come in to the HV version
of the kvmppc_interrupt code, which then jumps to the PR handler,
renamed to kvmppc_interrupt_pr, if the guest is a PR guest.

Allowing both PR and HV in the same kernel required some changes to
kvm_dev_ioctl_check_extension(), since the values returned now can't
be selected with #ifdefs as much as previously. We look at is_hv_enabled
to return the right value when checking for capabilities.For capabilities that
are only provided by HV KVM, we return the HV value only if
is_hv_enabled is true. For capabilities provided by PR KVM but not HV,
we return the PR value only if is_hv_enabled is false.

NOTE: in later patch we replace is_hv_enabled with a static inline
function comparing kvm_ppc_ops

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 3a167bea 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: Add kvmppc_ops callback

This patch add a new callback kvmppc_ops. This will help us in enabling
both HV and PR KVM together in the same kernel. The actual change to
enable them together is done in the later patch in the series.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: squash in booke changes]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 9975f5e3 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: book3s: Add a new config variable CONFIG_KVM_BOOK3S_HV_POSSIBLE

This help ups to select the relevant code in the kernel code
when we later move HV and PR bits as seperate modules. The patch
also makes the config options for PR KVM selectable

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 5f1c248f 10-Jul-2013 Scott Wood <scottwood@freescale.com>

kvm/ppc: Call trace_hardirqs_on before entry

Currently this is only being done on 64-bit. Rather than just move it
out of the 64-bit ifdef, move it to kvm_lazy_ee_enable() so that it is
consistent with lazy ee state, and so that we don't track more host
code as interrupts-enabled than necessary.

Rename kvm_lazy_ee_enable() to kvm_fix_ee_before_entry() to reflect
that this function now has a role on 32-bit as well.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 6c45b810 01-Jul-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/kvm: Contiguous memory allocator based RMA allocation

Older version of power architecture use Real Mode Offset register and Real Mode Limit
Selector for mapping guest Real Mode Area. The guest RMA should be physically
contigous since we use the range when address translation is not enabled.

This patch switch RMA allocation code to use contigous memory allocator. The patch
also remove the the linear allocator which not used any more

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# fa61a4e3 01-Jul-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

powerpc/kvm: Contiguous memory allocator based hash page table allocation

Powerpc architecture uses a hash based page table mechanism for mapping virtual
addresses to physical address. The architecture require this hash page table to
be physically contiguous. With KVM on Powerpc currently we use early reservation
mechanism for allocating guest hash page table. This implies that we need to
reserve a big memory region to ensure we can create large number of guest
simultaneously with KVM on Power. Another disadvantage is that the reserved memory
is not available to rest of the subsystems and and that implies we limit the total
available memory in the host.

This patch series switch the guest hash page table allocation to use
contiguous memory allocator.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 5975a2e0 26-Apr-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Add API for in-kernel XICS emulation

This adds the API for userspace to instantiate an XICS device in a VM
and connect VCPUs to it. The API consists of a new device type for
the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
which is used to assert and deassert interrupt inputs of the XICS.

The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
Each attribute within this group corresponds to the state of one
interrupt source. The attribute number is the same as the interrupt
source number.

This does not support irq routing or irqfd yet.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 8b78645c 17-Apr-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler state

This adds the ability for userspace to save and restore the state
of the XICS interrupt presentation controllers (ICPs) via the
KVM_GET/SET_ONE_REG interface. Since there is one ICP per vcpu, we
simply define a new 64-bit register in the ONE_REG space for the ICP
state. The state includes the CPU priority setting, the pending IPI
priority, and the priority and source number of any pending external
interrupt.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# d19bd862 17-Apr-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Add support for ibm,int-on/off RTAS calls

This adds support for the ibm,int-on and ibm,int-off RTAS calls to the
in-kernel XICS emulation and corrects the handling of the saved
priority by the ibm,set-xive RTAS call. With this, ibm,int-off sets
the specified interrupt's priority in its saved_priority field and
sets the priority to 0xff (the least favoured value). ibm,int-on
restores the saved_priority to the priority field, and ibm,set-xive
sets both the priority and the saved_priority to the specified
priority value.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 54695c30 17-Apr-2013 Benjamin Herrenschmidt <benh@kernel.crashing.org>

KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM

Currently, we wake up a CPU by sending a host IPI with
smp_send_reschedule() to thread 0 of that core, which will take all
threads out of the guest, and cause them to re-evaluate their
interrupt status on the way back in.

This adds a mechanism to differentiate real host IPIs from IPIs sent
by KVM for guest threads to poke each other, in order to target the
guest threads precisely when possible and avoid that global switch of
the core to host state.

We then use this new facility in the in-kernel XICS code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# bc5ad3f3 17-Apr-2013 Benjamin Herrenschmidt <benh@kernel.crashing.org>

KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs. Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Each vcpu gets one ICP (interrupt controller presentation) entity,
used to store the per-vcpu state such as vcpu priority, pending
interrupt state, IPI request, etc.

This does not include any API or any way to connect vcpus to their
ICP state; that will be added in later patches.

This is based on an initial implementation by Michael Ellerman
<michael@ellerman.id.au> reworked by Benjamin Herrenschmidt and
Paul Mackerras.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix typo, add dependency on !KVM_MPIC]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 8e591cb7 17-Apr-2013 Michael Ellerman <michael@ellerman.id.au>

KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls

For pseries machine emulation, in order to move the interrupt
controller code to the kernel, we need to intercept some RTAS
calls in the kernel itself. This adds an infrastructure to allow
in-kernel handlers to be registered for RTAS services by name.
A new ioctl, KVM_PPC_RTAS_DEFINE_TOKEN, then allows userspace to
associate token values with those service names. Then, when the
guest requests an RTAS service with one of those token values, it
will be handled by the relevant in-kernel handler rather than being
passed up to userspace as at present.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix warning]
Signed-off-by: Alexander Graf <agraf@suse.de>


# eb1e4f43 12-Apr-2013 Scott Wood <scottwood@freescale.com>

kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

Enabling this capability connects the vcpu to the designated in-kernel
MPIC. Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: add stub functions for kvmppc_mpic_{dis,}connect_vcpu]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 5df554ad 12-Apr-2013 Scott Wood <scottwood@freescale.com>

kvm/ppc/mpic: in-kernel MPIC emulation

Hook the MPIC code up to the KVM interfaces, add locking, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: add stub function for kvmppc_mpic_set_epr, non-booke, 64bit]
Signed-off-by: Alexander Graf <agraf@suse.de>


# c402a3f4 07-Apr-2013 Bharat Bhushan <r65777@freescale.com>

Rename EMULATE_DO_PAPR to EMULATE_EXIT_USER

Instruction emulation return EMULATE_DO_PAPR when it requires
exit to userspace on book3s. Similar return is required
for booke. EMULATE_DO_PAPR reads out to be confusing so it is
renamed to EMULATE_EXIT_USER.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# adccf65c 25-Apr-2013 Bharat Bhushan <Bharat.Bhushan@freescale.com>

KVM: PPC: cache flush for kernel managed pages

Kernel can only access pages which maps as memory.
So flush only the valid kernel pages.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 4fe27d2a 14-Feb-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external

Currently kvmppc_core_dequeue_external() takes a struct kvm_interrupt *
argument and does nothing with it, in any of its implementations.
This removes it in order to make things easier for forthcoming
in-kernel interrupt controller emulation code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 8482644a 27-Feb-2013 Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>

KVM: set_memory_region: Refactor commit_memory_region()

This patch makes the parameter old a const pointer to the old memory
slot and adds a new parameter named change to know the change being
requested: the former is for removing extra copying and the latter is
for cleaning up the code.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 1d542d9c 15-Jan-2013 Bharat Bhushan <Bharat.Bhushan@freescale.com>

KVM: PPC: booke: Allow multiple exception types

Current kvmppc_booke_handlers uses the same macro (KVM_HANDLER) and
all handlers are considered to be the same size. This will not be
the case if we want to use different macros for different handlers.

This patch improves the kvmppc_booke_handler so that it can
support different macros for different handlers.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
[bharat.bhushan@freescale.com: Substantial changes]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 1c810636 04-Jan-2013 Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: Implement EPR exit

The External Proxy Facility in FSL BookE chips allows the interrupt
controller to automatically acknowledge an interrupt as soon as a
core gets its pending external interrupt delivered.

Today, user space implements the interrupt controller, so we need to
check on it during such a cycle.

This patch implements logic for user space to enable EPR exiting,
disable EPR exiting and EPR exiting itself, so that user space can
acknowledge an interrupt when an external interrupt has successfully
been delivered into the guest vcpu.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 50c7bb80 14-Dec-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S: PR: Enable alternative instruction for SC 1

When running on top of pHyp, the hypercall instruction "sc 1" goes
straight into pHyp without trapping in supervisor mode.

So if we want to support PAPR guest in this configuration we need to
add a second way of accessing PAPR hypercalls, preferably with the
exact same semantics except for the instruction.

So let's overlay an officially reserved instruction and emulate PAPR
hypercalls whenever we hit that one.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 8823a8fd 11-Oct-2012 Mihai Caraman <mihai.caraman@freescale.com>

KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation

Mask high 32 bits of effective address in emulation layer for guests running
in 32-bit mode.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[agraf: fix indent]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 7cdd7a95 11-Oct-2012 Mihai Caraman <mihai.caraman@freescale.com>

KVM: PPC: e500: Add emulation helper for getting instruction ea

Add emulation helper for getting instruction ea and refactor tlb instruction
emulation to use it.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[agraf: keep rt variable around]
Signed-off-by: Alexander Graf <agraf@suse.de>


# a2932923 19-Nov-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on
this fd return the contents of the HPT (hashed page table), writes
create and/or remove entries in the HPT. There is a new capability,
KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl
takes an argument structure with the index of the first HPT entry to
read out and a set of flags. The flags indicate whether the user is
intending to read or write the HPT, and whether to return all entries
or only the "bolted" entries (those with the bolted bit, 0x10, set in
the first doubleword).

This is intended for use in implementing qemu's savevm/loadvm and for
live migration. Therefore, on reads, the first pass returns information
about all HPTEs (or all bolted HPTEs). When the first pass reaches the
end of the HPT, it returns from the read. Subsequent reads only return
information about HPTEs that have changed since they were last read.
A read that finds no changed HPTEs in the HPT following where the last
read finished will return 0 bytes.

The format of the data provides a simple run-length compression of the
invalid entries. Each block of data starts with a header that indicates
the index (position in the HPT, which is just an array), the number of
valid entries starting at that index (may be zero), and the number of
invalid entries following those valid entries. The valid entries, 16
bytes each, follow the header. The invalid entries are not explicitly
represented.

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix documentation]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 55b665b0 25-Sep-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S HV: Provide a way for userspace to get/set per-vCPU areas

The PAPR paravirtualization interface lets guests register three
different types of per-vCPU buffer areas in its memory for communication
with the hypervisor. These are called virtual processor areas (VPAs).
Currently the hypercalls to register and unregister VPAs are handled
by KVM in the kernel, and userspace has no way to know about or save
and restore these registrations across a migration.

This adds "register" codes for these three areas that userspace can
use with the KVM_GET/SET_ONE_REG ioctls to see what addresses have
been registered, and to register or unregister them. This will be
needed for guest hibernation and migration, and is also needed so
that userspace can unregister them on reset (otherwise we corrupt
guest memory after reboot by writing to the VPAs registered by the
previous kernel).

The "register" for the VPA is a 64-bit value containing the address,
since the length of the VPA is fixed. The "registers" for the SLB
shadow buffer and dispatch trace log (DTL) are 128 bits long,
consisting of the guest physical address in the high (first) 64 bits
and the length in the low 64 bits.

This also fixes a bug where we were calling init_vpa unconditionally,
leading to an oops when unregistering the VPA.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# a8bd19ef 25-Sep-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Get/set guest FP regs using the GET/SET_ONE_REG interface

This enables userspace to get and set all the guest floating-point
state using the KVM_[GS]ET_ONE_REG ioctls. The floating-point state
includes all of the traditional floating-point registers and the
FPSCR (floating point status/control register), all the VMX/Altivec
vector registers and the VSCR (vector status/control register), and
on POWER7, the vector-scalar registers (note that each FP register
is the high-order half of the corresponding VSR).

Most of these are implemented in common Book 3S code, except for VSX
on POWER7. Because HV and PR differ in how they store the FP and VSX
registers on POWER7, the code for these cases is not common. On POWER7,
the FP registers are the upper halves of the VSX registers vsr0 - vsr31.
PR KVM stores vsr0 - vsr31 in two halves, with the upper halves in the
arch.fpr[] array and the lower halves in the arch.vsr[] array, whereas
HV KVM on POWER7 stores the whole VSX register in arch.vsr[].

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix whitespace, vsx compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>


# a136a8bd 25-Sep-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface

This enables userspace to get and set various SPRs (special-purpose
registers) using the KVM_[GS]ET_ONE_REG ioctls. With this, userspace
can get and set all the SPRs that are part of the guest state, either
through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or
the KVM_[GS]ET_ONE_REG ioctls.

The SPRs that are added here are:

- DABR: Data address breakpoint register
- DSCR: Data stream control register
- PURR: Processor utilization of resources register
- SPURR: Scaled PURR
- DAR: Data address register
- DSISR: Data storage interrupt status register
- AMR: Authority mask register
- UAMOR: User authority mask override register
- MMCR0, MMCR1, MMCRA: Performance monitor unit control registers
- PMC1..PMC8: Performance monitor unit counter registers

In order to reduce code duplication between PR and HV KVM code, this
moves the kvm_vcpu_ioctl_[gs]et_one_reg functions into book3s.c and
centralizes the copying between user and kernel space there. The
registers that are handled differently between PR and HV, and those
that exist only in one flavor, are handled in kvmppc_[gs]et_one_reg()
functions that are specific to each flavor.

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: minimal style fixes]
Signed-off-by: Alexander Graf <agraf@suse.de>


# dfe49dbd 11-Sep-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

This adds an implementation of kvm_arch_flush_shadow_memslot for
Book3S HV, and arranges for kvmppc_core_commit_memory_region to
flush the dirty log when modifying an existing slot. With this,
we can handle deletion and modification of memory slots.

kvm_arch_flush_shadow_memslot calls kvmppc_core_flush_memslot, which
on Book3S HV now traverses the reverse map chains to remove any HPT
(hashed page table) entries referring to pages in the memslot. This
gets called by generic code whenever deleting a memslot or changing
the guest physical address for a memslot.

We flush the dirty log in kvmppc_core_commit_memory_region for
consistency with what x86 does. We only need to flush when an
existing memslot is being modified, because for a new memslot the
rmap array (which stores the dirty bits) is all zero, meaning that
every page is considered clean already, and when deleting a memslot
we obviously don't care about the dirty bits any more.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# a66b48c3 11-Sep-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Move kvm->arch.slot_phys into memslot.arch

Now that we have an architecture-specific field in the kvm_memory_slot
structure, we can use it to store the array of page physical addresses
that we need for Book3S HV KVM on PPC970 processors. This reduces the
size of struct kvm_arch for Book3S HV, and also reduces the size of
struct kvm_arch_memory_slot for other PPC KVM variants since the fields
in it are now only compiled in for Book3S HV.

This necessitates making the kvm_arch_create_memslot and
kvm_arch_free_memslot operations specific to each PPC KVM variant.
That in turn means that we now don't allocate the rmap arrays on
Book3S PR and Book E.

Since we now unpin pages and free the slot_phys array in
kvmppc_core_free_memslot, we no longer need to do it in
kvmppc_core_destroy_vm, since the generic code takes care to free
all the memslots when destroying a VM.

We now need the new memslot to be passed in to
kvmppc_core_prepare_memory_region, since we need to initialize its
arch.slot_phys member on Book3S HV.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# f61c94bb 08-Aug-2012 Bharat Bhushan <r65777@freescale.com>

KVM: PPC: booke: Add watchdog emulation

This patch adds the watchdog emulation in KVM. The watchdog
emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_BOOKE_WATCHDOG) ioctl.
The kernel timer are used for watchdog emulation and emulates
h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
it is configured.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
[bharat.bhushan@freescale.com: reworked patch]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
[agraf: adjust to new request framework]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 7c973a2e 12-Aug-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: Add return value to core_check_requests

Requests may want to tell us that we need to go back into host state,
so add a return value for the checks.

Signed-off-by: Alexander Graf <agraf@suse.de>


# bd2be683 12-Aug-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S: PR: Rework irq disabling

Today, we disable preemption while inside guest context, because we need
to expose to the world that we are not in a preemptible context. However,
during that time we already have interrupts disabled, which would indicate
that we are in a non-preemptible context.

The reason the checks for irqs_disabled() fail for us though is that we
manually control hard IRQs and ignore all the lazy EE framework. Let's
stop doing that. Instead, let's always use lazy EE to indicate when we
want to disable IRQs, but do a special final switch that gets us into
EE disabled, but soft enabled state. That way when we get back out of
guest state, we are immediately ready to process interrupts.

This simplifies the code drastically and reduces the time that we appear
as preempt disabled.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 03d25c5b 09-Aug-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: Use same kvmppc_prepare_to_enter code for booke and book3s_pr

We need to do the same things when preparing to enter a guest for booke and
book3s_pr cores. Fold the generic code into a generic function that both call.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 862d31f7 30-Jul-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: E500: Implement MMU notifiers

The e500 target has lived without mmu notifiers ever since it got
introduced, but fails for the user space check on them with hugetlbfs.

So in order to get that one working, implement mmu notifiers in a
reasonably dumb fashion and be happy. On embedded hardware, we almost
never end up with mmu notifier calls, since most people don't overcommit.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 249ba1ee 03-Aug-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: Add cache flush on page map

When we map a page that wasn't icache cleared before, do so when first
mapping it in KVM using the same information bits as the Linux mapping
logic. That way we are 100% sure that any page we map does not have stale
entries in the icache.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 32fad281 03-May-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S HV: Make the guest hash table size configurable

This adds a new ioctl to enable userspace to control the size of the guest
hashed page table (HPT) and to clear it out when resetting the guest.
The KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter
a pointer to a u32 containing the desired order of the HPT (log base 2
of the size in bytes), which is updated on successful return to the
actual order of the HPT which was allocated.

There must be no vcpus running at the time of this ioctl. To enforce
this, we now keep a count of the number of vcpus running in
kvm->arch.vcpus_running.

If the ioctl is called when a HPT has already been allocated, we don't
reallocate the HPT but just clear it out. We first clear the
kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
the kvm->lock mutex, it will prevent any vcpus from starting to run until
we're done, and (b) it means that the first vcpu to run after we're done
will re-establish the VRMA if necessary.

If userspace doesn't call this ioctl before running the first vcpu, the
kernel will allocate a default-sized HPT at that point. We do it then
rather than when creating the VM, as the code did previously, so that
userspace has a chance to do the ioctl if it wants.

When allocating the HPT, we can allocate either from the kernel page
allocator, or from the preallocated pool. If userspace is asking for
a different size from the preallocated HPTs, we first try to allocate
using the kernel page allocator. Then we try to allocate from the
preallocated pool, and then if that fails, we try allocating decreasing
sizes from the kernel page allocator, down to the minimum size allowed
(256kB). Note that the kernel page allocator limits allocations to
1 << CONFIG_FORCE_MAX_ZONEORDER pages, which by default corresponds to
16MB (on 64-bit powerpc, at least).

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix module compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 54771e62 04-May-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: Emulator: clean up SPR reads and writes

When reading and writing SPRs, every SPR emulation piece had to read
or write the respective GPR the value was read from or stored in itself.

This approach is pretty prone to failure. What if we accidentally
implement mfspr emulation where we just do "break" and nothing else?
Suddenly we would get a random value in the return register - which is
always a bad idea.

So let's consolidate the generic code paths and only give the core
specific SPR handling code readily made variables to read/write from/to.

Functionally, this patch doesn't change anything, but it increases the
readability of the code and makes is less prone to bugs.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 5b74716e 26-Apr-2012 Benjamin Herrenschmidt <benh@kernel.crashing.org>

kvm/powerpc: Add new ioctl to retreive server MMU infos

This is necessary for qemu to be able to pass the right information
to the guest, such as the supported page sizes and corresponding
encodings in the SLB and hash table, which can vary depending
on the processor type, the type of KVM used (PR vs HV) and the
version of KVM

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[agraf: fix compilation on hv, adjust for newer ioctl numbers]
Signed-off-by: Alexander Graf <agraf@suse.de>


# f31e65e1 15-Mar-2012 Benjamin Herrenschmidt <benh@kernel.crashing.org>

kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM

There is nothing in the code for emulating TCE tables in the kernel
that prevents it from working on "PR" KVM... other than ifdef's and
location of the code.

This and moves the bulk of the code there to a new file called
book3s_64_vio.c.

This speeds things up a bit on my G5.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[agraf: fix for hv kvm, 32bit, whitespace]
Signed-off-by: Alexander Graf <agraf@suse.de>


# a8e4ef84 16-Feb-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: booke: rework rescheduling checks

Instead of checking whether we should reschedule only when we exited
due to an interrupt, let's always check before entering the guest back
again. This gets the target more in line with the other archs.

Also while at it, generalize the whole thing so that eventually we could
have a single kvmppc_prepare_to_enter function for all ppc targets that
does signal and reschedule checking for us.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# d30f6e48 20-Dec-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: booke: category E.HV (GS-mode) support

Chips such as e500mc that implement category E.HV in Power ISA 2.06
provide hardware virtualization features, including a new MSR mode for
guest state. The guest OS can perform many operations without trapping
into the hypervisor, including transitions to and from guest userspace.

Since we can use SRR1[GS] to reliably tell whether an exception came from
guest state, instead of messing around with IVPR, we use DO_KVM similarly
to book3s.

Current issues include:
- Machine checks from guest state are not routed to the host handler.
- The guest can cause a host oops by executing an emulated instruction
in a page that lacks read permission. Existing e500/4xx support has
the same problem.

Includes work by Ashish Kalra <Ashish.Kalra@freescale.com>,
Varun Sethi <Varun.Sethi@freescale.com>, and
Liu Yu <yu.liu@freescale.com>.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: remove pt_regs usage]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 043cc4d7 20-Dec-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: factor out lpid allocator from book3s_64_mmu_hv

We'll use it on e500mc as well.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# d2a1b483 16-Jan-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: Add HPT preallocator

We're currently allocating 16MB of linear memory on demand when creating
a guest. That does work some times, but finding 16MB of linear memory
available in the system at runtime is definitely not a given.

So let's add another command line option similar to the RMA preallocator,
that we can use to keep a pool of page tables around. Now, when a guest
gets created it has a pretty low chance of receiving an OOM.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# b4e70611 16-Jan-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: Convert RMA allocation into generic code

We have code to allocate big chunks of linear memory on bootup for later use.
This code is currently used for RMA allocation, but can be useful beyond that
extent.

Make it generic so we can reuse it for other stuff later.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 31f3438e 11-Dec-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Move kvm_vcpu_ioctl_[gs]et_one_reg down to platform-specific code

This moves the get/set_one_reg implementation down from powerpc.c into
booke.c, book3s_pr.c and book3s_hv.c. This avoids #ifdefs in C code,
but more importantly, it fixes a bug on Book3s HV where we were
accessing beyond the end of the kvm_vcpu struct (via the to_book3s()
macro) and corrupting memory, causing random crashes and file corruption.

On Book3s HV we only accept setting the HIOR to zero, since the guest
runs in supervisor mode and its vectors are never offset from zero.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
[agraf update to apply on top of changed ONE_REG patches]
Signed-off-by: Avi Kivity <avi@redhat.com>


# da9d1d7f 11-Dec-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Allow use of small pages to back Book3S HV guests

This relaxes the requirement that the guest memory be provided as
16MB huge pages, allowing it to be provided as normal memory, i.e.
in pages of PAGE_SIZE bytes (4k or 64k). To allow this, we index
the kvm->arch.slot_phys[] arrays with a small page index, even if
huge pages are being used, and use the low-order 5 bits of each
entry to store the order of the enclosing page with respect to
normal pages, i.e. log_2(enclosing_page_size / PAGE_SIZE).

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# c77162de 11-Dec-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()

This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.

Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.

When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.

Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# dfd4d47e 16-Nov-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: booke: Improve timer register emulation

Decrementers are now properly driven by TCR/TSR, and the guest
has full read/write access to these registers.

The decrementer keeps ticking (and setting the TSR bit) regardless of
whether the interrupts are enabled with TCR.

The decrementer stops at zero, rather than going negative.

Decrementers (and FITs, once implemented) are delivered as
level-triggered interrupts -- dequeued when the TSR bit is cleared, not
on delivery.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
[scottwood@freescale.com: significant changes]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7e28e60e 08-Nov-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: Rename deliver_interrupts to prepare_to_enter

This function also updates paravirt int_pending, so rename it
to be more obvious that this is a collection of checks run prior
to (re)entering a guest.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# dc83b8bc 18-Aug-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: e500: MMU API

This implements a shared-memory API for giving host userspace access to
the guest's TLB.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# af8f38b3 10-Aug-2011 Alexander Graf <agraf@suse.de>

KVM: PPC: Add sanity checking to vcpu_run

There are multiple features in PowerPC KVM that can now be enabled
depending on the user's wishes. Some of the combinations don't make
sense or don't work though.

So this patch adds a way to check if the executing environment would
actually be able to run the guest properly. It also adds sanity
checks if PVR is set (should always be true given the current code
flow), if PAPR is only used with book3s_64 where it works and that
HV KVM is only used in PAPR mode.

Signed-off-by: Alexander Graf <agraf@suse.de>


# aa04b4cc 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests

This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.

Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.

KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.

This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.

Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.

This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.

Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 371fefd6 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Allow book3s_hv guests to use SMT processor modes

This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.

This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.

To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.

When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.

We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.

When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).

It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.

Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 54738c09 28-Jun-2011 David Gibson <dwg@au1.ibm.com>

KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode

This improves I/O performance for guests using the PAPR
paravirtualization interface by making the H_PUT_TCE hcall faster, by
implementing it in real mode. H_PUT_TCE is used for updating virtual
IOMMU tables, and is used both for virtual I/O and for real I/O in the
PAPR interface.

Since this moves the IOMMU tables into the kernel, we define a new
KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables. The
ioctl returns a file descriptor which can be used to mmap the newly
created table. The qemu driver models use them in the same way as
userspace managed tables, but they can be updated directly by the
guest with a real-mode H_PUT_TCE implementation, reducing the number
of host/guest context switches during guest IO.

There are certain circumstances where it is useful for userland qemu
to write to the TCE table even if the kernel H_PUT_TCE path is used
most of the time. Specifically, allowing this will avoid awkwardness
when we need to reset the table. More importantly, we will in the
future need to write the table in order to restore its state after a
checkpoint resume or migration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# a8606e20 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Handle some PAPR hcalls in the kernel

This adds the infrastructure for handling PAPR hcalls in the kernel,
either early in the guest exit path while we are still in real mode,
or later once the MMU has been turned back on and we are in the full
kernel context. The advantage of handling hcalls in real mode if
possible is that we avoid two partition switches -- and this will
become more important when we support SMT4 guests, since a partition
switch means we have to pull all of the threads in the core out of
the guest. The disadvantage is that we can only access the kernel
linear mapping, not anything vmalloced or ioremapped, since the MMU
is off.

This also adds code to handle the following hcalls in real mode:

H_ENTER Add an HPTE to the hashed page table
H_REMOVE Remove an HPTE from the hashed page table
H_READ Read HPTEs from the hashed page table
H_PROTECT Change the protection bits in an HPTE
H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
H_SET_DABR Set the data address breakpoint register

Plus code to handle the following hcalls in the kernel:

H_CEDE Idle the vcpu until an interrupt or H_PROD hcall arrives
H_PROD Wake up a ceded vcpu
H_REGISTER_VPA Register a virtual processor area (VPA)

The code that runs in real mode has to be in the base kernel, not in
the module, if KVM is compiled as a module. The real-mode code can
only access the kernel linear mapping, not vmalloc or ioremap space.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# de56a948 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Add support for Book3S processors in hypervisor mode

This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.

This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.

Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.

This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.

With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.

We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.

In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.

The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.

This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.

This also adds a few new exports needed by the book3s_hv code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# df6909e5 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Move guest enter/exit down into subarch-specific code

Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
calls in powerpc.c, this moves them down into the subarch-specific
book3s_pr.c and booke.c. This eliminates an extra local_irq_enable()
call in book3s_pr.c, and will be needed for when we do SMT4 guest
support in the book3s hypervisor mode code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# f9e0554d 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Pass init/destroy vm and prepare/commit memory region ops down

This arranges for the top-level arch/powerpc/kvm/powerpc.c file to
pass down some of the calls it gets to the lower-level subarchitecture
specific code. The lower-level implementations (in booke.c and book3s.c)
are no-ops. The coming book3s_hv.c will need this.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# a4cd8b23 14-Jun-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: e500: enable magic page

This is a shared page used for paravirtualization. It is always present
in the guest kernel's effective address space at the address indicated
by the hypercall that enables it.

The physical address specified by the hypercall is not used, as
e500 does not have real mode.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 5ce941ee 27-Apr-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: booke: add sregs support

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 2a342ed5 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Implement hypervisor interface

To communicate with KVM directly we need to plumb some sort of interface
between the guest and KVM. Usually those interfaces use hypercalls.

This hypercall implementation is described in the last patch of the series
in a special documentation file. Please read that for further information.

This patch implements stubs to handle KVM PPC hypercalls on the host and
guest side alike.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 9cc5e953 15-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Extract MMU init

The host shadow mmu code needs to get initialized. It needs to fetch a
segment it can use to put shadow PTEs into.

That initialization code was in generic code, which is icky. Let's move
it over to the respective MMU file.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# c7f38f46 15-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Improve indirect svcpu accessors

We already have some inline fuctions we use to access vcpu or svcpu structs,
depending on whether we're on booke or book3s. Since we just put a few more
registers into the svcpu, we also need to make sure the respective callbacks
are available and get used.

So this patch moves direct use of the now in the svcpu struct fields to
inline function calls. While at it, it also moves the definition of those
inline function calls to respective header files for booke and book3s,
greatly improving readability.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 18978768 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Allow userspace to unset the IRQ line

Userspace can tell us that it wants to trigger an interrupt. But
so far it can't tell us that it wants to stop triggering one.

So let's interpret the parameter to the ioctl that we have anyways
to tell us if we want to raise or lower the interrupt line.

Signed-off-by: Alexander Graf <agraf@suse.de>

v2 -> v3:

- Add CAP for unset irq
Signed-off-by: Avi Kivity <avi@redhat.com>


# 0564ee8a 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add helpers to modify ppc fields

The PowerPC specification always lists bits from MSB to LSB. That is
really confusing when you're trying to write C code, because it fits
in pretty badly with the normal (1 << xx) schemes.

So I came up with some nice wrappers that allow to get and set fields
in a u64 with bit numbers exactly as given in the spec. That makes the
code in KVM and the spec easier comparable.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 37f5bca6 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add AGAIN type for emulation return

Emulation of an instruction can have different outcomes. It can succeed,
fail, require MMIO, do funky BookE stuff - or it can just realize something's
odd and will be fixed the next time around.

Exactly that is what EMULATE_AGAIN means. Using that flag we can now tell
the caller that nothing happened, but we still want to go back to the
guest and see what happens next time we come around.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 3587d534 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Teach MMIO Signedness

The guest I was trying to get to run uses the LHA and LHAU instructions.
Those instructions basically do a load, but also sign extend the result.

Since we need to fill our registers by hand when doing MMIO, we also need
to sign extend manually.

This patch implements sign extended MMIO and the LHA(U) instructions.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# b104d066 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Enable MMIO to do 64 bits, fprs and qprs

Right now MMIO access can only happen for GPRs and is at most 32 bit wide.
That's actually enough for almost all types of hardware out there.

Unfortunately, the guest I was using used FPU writes to MMIO regions, so
it ended up writing 64 bit MMIOs using FPRs and QPRs.

So let's add code to handle those odd cases too.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 1c0006d8 15-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Fix initial GPR settings

Commit 7d01b4c3ed2bb33ceaf2d270cb4831a67a76b51b introduced PACA backed vcpu
values. With this patch, when a userspace app was setting GPRs before it was
actually first loaded, the set values get discarded.

This is because vcpu_load loads them from the vcpu backing store that we use
whenever we're not owning the PACA.

That behavior is not really a major problem, because we don't need it for
qemu. Other users (like kvmctl) do have problems with it though, so let's
better do it right.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 25a8a02d 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Emulate trap SRR1 flags properly

Book3S needs some flags in SRR1 to get to know details about an interrupt.

One such example is the trap instruction. It tells the guest kernel that
a program interrupt is due to a trap using a bit in SRR1.

This patch implements above behavior, making WARN_ON behave like WARN_ON.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7e57cba0 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Use PACA backed shadow vcpu

We're being horribly racy right now. All the entry and exit code hijacks
random fields from the PACA that could easily be used by different code in
case we get interrupted, for example by a #MC or even page fault.

After discussing this with Ben, we figured it's best to reserve some more
space in the PACA and just shove off some vcpu state to there.

That way we can drastically improve the readability of the code, make it
less racy and less complex.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 992b5b29 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add helpers for CR, XER

We now have helpers for the GPRs, so let's also add some for CR and XER.

Having them in the PACA simplifies code a lot, as we don't need to care
about where to store CC or not to overflow any integers.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 8e5b26b5 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Use accessor functions for GPR access

All code in PPC KVM currently accesses gprs in the vcpu struct directly.

While there's nothing wrong with that wrt the current way gprs are stored
and loaded, it doesn't suffice for the PACA acceleration that will follow
in this patchset.

So let's just create little wrapper inline functions that we call whenever
a GPR needs to be read from or written to. The compiled code shouldn't really
change at all for now.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7706664d 21-Dec-2009 Alexander Graf <agraf@suse.de>

KVM: powerpc: Improve DEC handling

We treated the DEC interrupt like an edge based one. This is not true for
Book3s. The DEC keeps firing until mtdec is issued again and thus clears
the interrupt line.

So let's implement this logic in KVM too. This patch moves the line clearing
from the firing of the interrupt to the mtdec emulation.

This makes PPC64 guests work without AGGRESSIVE_DEC defined.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Acked-by: Hollis Blanchard <hollis@penguinppc.org>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 29eb61bc 29-Oct-2009 Alexander Graf <agraf@suse.de>

Add book3s_64 highmem asm code

This is the of entry / exit code. In order to switch between host and guest
context, we need to switch register state and call the exit code handler on
exit.

This assembly file does exactly that. To finally enter the guest it calls
into book3s_64_slb.S. On exit it gets jumped at from book3s_64_slb.S too.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# f5d0906b 04-Jan-2009 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: remove debug support broken by KVM debug rewrite

After the rewrite of KVM's debug support, this code doesn't even build any
more.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# b52a638c 03-Jan-2009 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: Add kvmppc_mmu_dtlb/itlb_miss for booke

When itlb or dtlb miss happens, E500 needs to update some mmu registers.
So that the auto-load mechanism can work on E500 when write a tlb entry.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# fa86b8dd 03-Jan-2009 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: rename 44x MMU functions used in booke.c

e500 will provide its own implementation of these.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# be8d1cae 03-Jan-2009 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: turn tlb_xlate() into a per-core hook (and give it a better name)

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 58a96214 03-Jan-2009 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: change kvmppc_mmu_map() parameters

Passing just the TLB index will ease an e500 implementation.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# ecc0981f 03-Jan-2009 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: cosmetic changes to mmu hook names

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7924bd41 02-Dec-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: directly insert shadow mappings into the hardware TLB

Formerly, we used to maintain a per-vcpu shadow TLB and on every entry to the
guest would load this array into the hardware TLB. This consumed 1280 bytes of
memory (64 entries of 16 bytes plus a struct page pointer each), and also
required some assembly to loop over the array on every entry.

Instead of saving a copy in memory, we can just store shadow mappings directly
into the hardware TLB, accepting that the host kernel will clobber these as
part of the normal 440 TLB round robin. When we do that we need less than half
the memory, and we have decreased the exit handling time for all guest exits,
at the cost of increased number of TLB misses because the host overwrites some
guest entries.

These savings will be increased on processors with larger TLBs or which
implement intelligent flush instructions like tlbivax (which will avoid the
need to walk arrays in software).

In addition to that and to the code simplification, we have a greater chance of
leaving other host userspace mappings in the TLB, instead of forcing all
subsequent tasks to re-fault all their mappings.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 89168618 02-Dec-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: support large host pages

KVM on 440 has always been able to handle large guest mappings with 4K host
pages -- we must, since the guest kernel uses 256MB mappings.

This patch makes KVM work when the host has large pages too (tested with 64K).

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# d4cf3892 05-Nov-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: optimize irq delivery path

In kvmppc_deliver_interrupt is just one case left in the switch and it is a
rare one (less than 8%) when looking at the exit numbers. Therefore we can
at least drop the switch/case and if an if. I inserted an unlikely too, but
that's open for discussion.

In kvmppc_can_deliver_interrupt all frequent cases are in the default case.
I know compilers are smart but we can make it easier for them. By writing
down all options and removing the default case combined with the fact that
ithe values are constants 0..15 should allow the compiler to write an easy
jump table.
Modifying kvmppc_can_deliver_interrupt pointed me to the fact that gcc seems
to be unable to reduce priority_exception[x] to a build time constant.
Therefore I changed the usage of the translation arrays in the interrupt
delivery path completely. It is now using priority without translation to irq
on the full irq delivery path.
To be able to do that ivpr regs are stored by their priority now.

Additionally the decision made in kvmppc_can_deliver_interrupt is already
sufficient to get the value of interrupt_msr_mask[x]. Therefore we can replace
the 16x4byte array used here with a single 4byte variable (might still be one
miss, but the chance to find this in cache should be better than the right
entry of the whole array).

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# db93f574 05-Nov-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: create struct kvm_vcpu_44x and introduce container_of() accessor

This patch doesn't yet move all 44x-specific data into the new structure, but
is the first step down that path. In the future we may also want to create a
struct kvm_vcpu_booke.

Based on patch from Liu Yu <yu.liu@freescale.com>.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 5cbb5106 05-Nov-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: Move the last bits of 44x code out of booke.c

Needed to port to other Book E processors.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 75f74f0d 05-Nov-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: refactor instruction emulation into generic and core-specific pieces

Cores provide 3 emulation hooks, implemented for example in the new
4xx_emulate.c:
kvmppc_core_emulate_op
kvmppc_core_emulate_mtspr
kvmppc_core_emulate_mfspr

Strictly speaking the last two aren't necessary, but provide for more
informative error reporting ("unknown SPR").

Long term I'd like to have instruction decoding autogenerated from tables of
opcodes, and that way we could aggregate universal, Book E, and core-specific
instructions more easily and without redundant switch statements.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 9dd921cf 05-Nov-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: Refactor powerpc.c to relocate 440-specific code

This introduces a set of core-provided hooks. For 440, some of these are
implemented by booke.c, with the rest in (the new) 44x.c.

Note that these hooks are link-time, not run-time. Since it is not possible to
build a single kernel for both e500 and 440 (for example), using function
pointers would only add overhead.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 0f55dc48 05-Nov-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: Rename "struct tlbe" to "struct kvmppc_44x_tlbe"

This will ease ports to other cores.

Also remove unused "struct kvm_tlb" while we're at it.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# a0d7b9f2 05-Nov-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: Move 440-specific TLB code into 44x_tlb.c

This will make it easier to provide implementations for other cores.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# c30f8a6c 24-Nov-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: stop leaking host memory on VM exit

When the VM exits, we must call put_page() for every page referenced in the
shadow TLB.

Without this patch, we usually leak 30-50 host pages (120 - 200 KiB with 4 KiB
pages). The maximum number of pages leaked is the size of our shadow TLB, 64
pages.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 49dd2c49 25-Jul-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: powerpc: Map guest userspace with TID=0 mappings

When we use TID=N userspace mappings, we must ensure that kernel mappings have
been destroyed when entering userspace. Using TID=1/TID=0 for kernel/user
mappings and running userspace with PID=0 means that userspace can't access the
kernel mappings, but the kernel can directly access userspace.

The net is that we don't need to flush the TLB on privilege switches, but we do
on guest context switches (which are far more infrequent). Guest boot time
performance improvement: about 30%.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>


# 83aae4a8 25-Jul-2008 Hollis Blanchard <hollisb@us.ibm.com>

KVM: ppc: Write only modified shadow entries into the TLB on exit

Track which TLB entries need to be written, instead of overwriting everything
below the high water mark. Typically only a single guest TLB entry will be
modified in a single exit.

Guest boot time performance improvement: about 15%.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>


# b8b572e1 31-Jul-2008 Stephen Rothwell <sfr@canb.auug.org.au>

powerpc: Move include files to arch/powerpc/include/asm

from include/asm-powerpc. This is the result of a

mkdir arch/powerpc/include/asm
git mv include/asm-powerpc/* arch/powerpc/include/asm

Followed by a few documentation/comment fixups and a couple of places
where <asm-powepc/...> was being used explicitly. Of the latter only
one was outside the arch code and it is a driver only built for powerpc.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>