History log of /linux-master/arch/powerpc/kvm/book3s.c
Revision Date Author Comments
# df938a55 01-Dec-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Book3S HV nestedv2: Do not inject certain interrupts

There is no need to inject an external interrupt in
kvmppc_book3s_irqprio_deliver() as the test for BOOK3S_IRQPRIO_EXTERNAL
in kvmhv_run_single_vcpu() before guest entry will raise LPCR_MER if
needed. There is also no need to inject the decrementer interrupt as
this will be raised within the L2 if needed. Avoiding these injections
reduces H_GUEST_GET_STATE hcalls by the L1.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231201132618.555031-8-vaibhav@linux.ibm.com


# 7028ac8d 13-Sep-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Use accessors for VCPU registers

Introduce accessor generator macros for VCPU registers. Use the accessor
functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230914030600.16993-5-jniethe5@gmail.com


# 52425a3b 13-Sep-2023 Jordan Niethe <jniethe5@gmail.com>

KVM: PPC: Introduce FPR/VR accessor functions

Introduce accessor functions for floating point and vector registers
like the ones that exist for GPRs. Use these to replace the existing FPR
and VR accessor macros.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230914030600.16993-3-jniethe5@gmail.com


# 953e3739 07-Mar-2023 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Fetch prefixed instructions from the guest

In order to handle emulation of prefixed instructions in the guest,
this first makes vcpu->arch.last_inst be an unsigned long, i.e. 64
bits on 64-bit platforms. For prefixed instructions, the upper 32
bits are used for the prefix and the lower 32 bits for the suffix, and
both halves are byte-swapped if the guest endianness differs from the
host.

Next, vcpu->arch.emul_inst is now 64 bits wide, to match the HEIR
register on POWER10. Like HEIR, for a prefixed instruction it is
defined to have the prefix is in the top 32 bits and the suffix in the
bottom 32 bits, with both halves in the correct byte order.

kvmppc_get_last_inst is extended on 64-bit machines to put the prefix
and suffix in the right places in the ppc_inst_t being returned.

kvmppc_load_last_inst now returns the instruction in an unsigned long
in the same format as vcpu->arch.last_inst. It makes the decision
about whether to fetch a suffix based on the SRR1_PREFIXED bit in the
MSR image stored in the vcpu struct, which generally comes from SRR1
or HSRR1 on an interrupt. This bit is defined in Power ISA v3.1B to
be set if the interrupt occurred due to a prefixed instruction and
cleared otherwise for all interrupts except for instruction storage
interrupt, which does not come to the hypervisor. It is set to zero
for asynchronous interrupts such as external interrupts. In previous
ISA versions it was always set to 0 for all interrupts except
instruction storage interrupt.

The code in book3s_hv_rmhandlers.S that loads the faulting instruction
on a HDSI is only used on POWER8 and therefore doesn't ever need to
load a suffix.

[npiggin@gmail.com - check that the is-prefixed bit in SRR1 matches the
type of instruction that was fetched.]

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/ZAgsq9h1CCzouQuV@cleo


# 460ba21d 30-Mar-2023 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Permit SRR1 flags in more injected interrupt types

The prefix architecture in ISA v3.1 introduces a prefixed bit in SRR1
for many types of synchronous interrupts which is set when the interrupt
is caused by a prefixed instruction.

This requires KVM to be able to set this bit when injecting interrupts
into a guest. Plumb through the SRR1 "flags" argument to the core_queue
APIs where it's missing for this. For now they are set to 0, which is
no change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fixup kvmppc_core_queue_alignment() in booke.c]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230330103224.3589928-2-npiggin@gmail.com


# 81a1cf9f 30-Nov-2022 Sean Christopherson <seanjc@google.com>

KVM: Drop kvm_arch_check_processor_compat() hook

Drop kvm_arch_check_processor_compat() and its support code now that all
architecture implementations are nops.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Acked-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-33-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ae19b15d 30-Nov-2022 Sean Christopherson <seanjc@google.com>

KVM: PPC: Move processor compatibility check to module init

Move KVM PPC's compatibility checks to their respective module_init()
hooks, there's no need to wait until KVM's common compat check, nor is
there a need to perform the check on every CPU (provided by common KVM's
hook), as the compatibility checks operate on global data.

arch/powerpc/include/asm/cputable.h: extern struct cpu_spec *cur_cpu_spec;
arch/powerpc/kvm/book3s.c: return 0
arch/powerpc/kvm/e500.c: strcmp(cur_cpu_spec->cpu_name, "e500v2")
arch/powerpc/kvm/e500mc.c: strcmp(cur_cpu_spec->cpu_name, "e500mc")
strcmp(cur_cpu_spec->cpu_name, "e5500")
strcmp(cur_cpu_spec->cpu_name, "e6500")

Cc: Fabiano Rosas <farosas@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Message-Id: <20221130230934.1014142-27-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# eaaaed13 06-Dec-2021 Sean Christopherson <seanjc@google.com>

KVM: PPC: Avoid referencing userspace memory region in memslot updates

For PPC HV, get the number of pages directly from the new memslot instead
of computing the same from the userspace memory region, and explicitly
check for !DELETE instead of inferring the same when toggling mmio_update.
The motivation for these changes is to avoid referencing the @mem param
so that it can be dropped in a future commit.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1e97fb5198be25f98ef82e63a8d770c682264cc9.1638817639.git.maciej.szmigiero@oracle.com>


# 537a17b3 06-Dec-2021 Sean Christopherson <seanjc@google.com>

KVM: Let/force architectures to deal with arch specific memslot data

Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch
code to handle propagating arch specific data from "new" to "old" when
necessary. This is a baby step towards dynamically allocating "new" from
the get go, and is a (very) minor performance boost on x86 due to not
unnecessarily copying arch data.

For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE
and FLAGS_ONLY. This is functionally a nop as the previous behavior
would overwrite the pointer for CREATE, and eventually discard/ignore it
for DELETE.

For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV,
x86 needs to reallocate arch data in the MOVE case as the size of x86's
allocations depend on the alignment of the memslot's gfn.

Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to
match the "commit" prototype.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[mss: add missing RISCV kvm_arch_prepare_memory_region() change]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>


# 87bcc5fa 02-Aug-2021 Jing Zhang <jingzhangos@google.com>

KVM: stats: Add halt_wait_ns stats for all architectures

Add simple stats halt_wait_ns to record the time a VCPU has spent on
waiting for all architectures (not just powerpc).

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210802165633.1866976-5-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f95937cc 02-Aug-2021 Jing Zhang <jingzhangos@google.com>

KVM: stats: Support linear and logarithmic histogram statistics

Add new types of KVM stats, linear and logarithmic histogram.
Histogram are very useful for observing the value distribution
of time or size related stats.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210802165633.1866976-2-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# bc9e9e67 23-Jun-2021 Jing Zhang <jingzhangos@google.com>

KVM: debugfs: Reuse binary stats descriptors

To remove code duplication, use the binary stats descriptors in the
implementation of the debugfs interface for statistics. This unifies
the definition of statistics for the binary and debugfs interfaces.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ce55c049 18-Jun-2021 Jing Zhang <jingzhangos@google.com>

KVM: stats: Support binary stats retrieval for a VCPU

Add a VCPU ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VCPU stats header,
descriptors and data.
Define VCPU statistics descriptors and header for all architectures.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-5-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# fcfe1bae 18-Jun-2021 Jing Zhang <jingzhangos@google.com>

KVM: stats: Support binary stats retrieval for a VM

Add a VM ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VM stats header,
descriptors and data.
Define VM statistics descriptors and header for all architectures.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-4-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 0193cc90 18-Jun-2021 Jing Zhang <jingzhangos@google.com>

KVM: stats: Separate generic stats from architecture specific ones

Generic KVM stats are those collected in architecture independent code
or those supported by all architectures; put all generic statistics in
a separate structure. This ensures that they are defined the same way
in the statistics API which is being added, removing duplication among
different architectures in the declaration of the descriptors.

No functional change intended.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-2-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# dcbac73a 28-May-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV: Remove virt mode checks from real mode handlers

Now that the P7/8 path no longer supports radix, real-mode handlers
do not need to deal with being called in virt mode.

This change effectively reverts commit acde25726bc6 ("KVM: PPC: Book3S
HV: Add radix checks in real-mode hypercall handlers").

It removes a few more real-mode tests in rm hcall handlers, which
allows the indirect ops for the xive module to be removed from the
built-in xics rm handlers.

kvmppc_h_random is renamed to kvmppc_rm_h_random to be a bit more
descriptive and consistent with other rm handlers.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-25-npiggin@gmail.com


# 9dc2babc 28-May-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path

In the interest of minimising the amount of code that is run in
"real-mode", don't handle hcalls in real mode in the P9 path. This
requires some new handlers for H_CEDE and xics-on-xive to be added
before xive is pulled or cede logic is checked.

This introduces a change in radix guest behaviour where radix guests
that execute 'sc 1' in userspace now get a privilege fault whereas
previously the 'sc 1' would be reflected as a syscall interrupt to the
guest kernel. That reflection is only required for hash guests that run
PR KVM.

Background:

In POWER8 and earlier processors, it is very expensive to exit from the
HV real mode context of a guest hypervisor interrupt, and switch to host
virtual mode. On those processors, guest->HV interrupts reach the
hypervisor with the MMU off because the MMU is loaded with guest context
(LPCR, SDR1, SLB), and the other threads in the sub-core need to be
pulled out of the guest too. Then the primary must save off guest state,
invalidate SLB and ERAT, and load up host state before the MMU can be
enabled to run in host virtual mode (~= regular Linux mode).

Hash guests also require a lot of hcalls to run due to the nature of the
MMU architecture and paravirtualisation design. The XICS interrupt
controller requires hcalls to run.

So KVM traditionally tries hard to avoid the full exit, by handling
hcalls and other interrupts in real mode as much as possible.

By contrast, POWER9 has independent MMU context per-thread, and in radix
mode the hypervisor is in host virtual memory mode when the HV interrupt
is taken. Radix guests do not require significant hcalls to manage their
translations, and xive guests don't need hcalls to handle interrupts. So
it's much less important for performance to handle hcalls in real mode on
POWER9.

One caveat is that the TCE hcalls are performance critical, real-mode
variants introduced for POWER8 in order to achieve 10GbE performance.
Real mode TCE hcalls were found to be less important on POWER9, which
was able to drive 40GBe networking without them (using the virt mode
hcalls) but performance is still important. These hcalls will benefit
from subsequent guest entry/exit optimisation including possibly a
faster "partial exit" that does not entirely switch to host context to
handle the hcall.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-14-npiggin@gmail.com


# b1c5356e 01-Apr-2021 Sean Christopherson <seanjc@google.com>

KVM: PPC: Convert to the gfn-based MMU notifier callbacks

Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion
PPC-internal hooks to work with gfns instead of hvas.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4e1b2ab7 10-Sep-2020 Greg Kurz <groug@kaod.org>

KVM: PPC: Don't return -ENOTSUPP to userspace in ioctls

ENOTSUPP is a linux only thingy, the value of which is unknown to
userspace, not to be confused with ENOTSUP which linux maps to
EOPNOTSUPP, as permitted by POSIX [1]:

[EOPNOTSUPP]
Operation not supported on socket. The type of socket (address family
or protocol) does not support the requested operation. A conforming
implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP].

Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls:
- KVM_GET_FPU for Book3s and BookE
- KVM_SET_FPU for Book3s and BookE
- KVM_GET_DIRTY_LOG for BookE

This doesn't affect QEMU which doesn't call the KVM_GET_FPU and
KVM_SET_FPU ioctls on POWER anyway since they are not supported,
and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html

Signed-off-by: Greg Kurz <groug@kaod.org>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 5706d14d 09-Aug-2020 Greg Kurz <groug@kaod.org>

KVM: PPC: Book3S HV: XICS: Replace the 'destroy' method by a 'release' method

Similarly to what was done with XICS-on-XIVE and XIVE native KVM devices
with commit 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy'
method by a 'release' method"), convert the historical XICS KVM device to
implement the 'release' method. This is needed to run nested guests with
an in-kernel IRQ chip. A typical POWER9 guest can select XICS or XIVE
during boot, which requires to be able to destroy and to re-create the
KVM device. Only the historical XICS KVM device is available under pseries
at the current time and it still uses the legacy 'destroy' method.

Switching to 'release' means that vCPUs might still be running when the
device is destroyed. In order to avoid potential use-after-free, the
kvmppc_xics structure is allocated on first usage and kept around until
the VM exits. The same pointer is used each time a KVM XICS device is
being created, but this is okay since we only have one per VM.

Clear the ICP of each vCPU with vcpu->mutex held. This ensures that the
next time the vCPU resumes execution, it won't be going into the XICS
code anymore.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# fdfe7cbd 11-Aug-2020 Will Deacon <will@kernel.org>

KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()

The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.

Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.

Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20200811102725.7121-2-will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 8c99d345 26-Apr-2020 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>

KVM: PPC: Clean up redundant 'kvm_run' parameters

In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 812756a8 14-Apr-2020 Emanuele Giuseppe Esposito <eesposit@redhat.com>

kvm_host: unify VM_STAT and VCPU_STAT definitions in a single place

The macros VM_STAT and VCPU_STAT are redundantly implemented in multiple
files, each used by a different architecure to initialize the debugfs
entries for statistics. Since they all have the same purpose, they can be
unified in a single common definition in include/linux/kvm_host.h

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20200414155625.20559-1-eesposit@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6fef0c6b 18-Mar-2020 Greg Kurz <groug@kaod.org>

KVM: PPC: Kill kvmppc_ops::mmu_destroy() and kvmppc_mmu_destroy()

These are only used by HV KVM and BookE, and in both cases they are
nops.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 0dff0846 18-Feb-2020 Sean Christopherson <seanjc@google.com>

KVM: Provide common implementation for generic dirty log functions

Move the implementations of KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG
for CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT into common KVM code.
The arch specific implemenations are extremely similar, differing
only in whether the dirty log needs to be sync'd from hardware (x86)
and how the TLBs are flushed. Add new arch hooks to handle sync
and TLB flush; the sync will also be used for non-generic dirty log
support in a future patch (s390).

The ulterior motive for providing a common implementation is to
eliminate the dependency between arch and common code with respect to
the memslot referenced by the dirty log, i.e. to make it obvious in the
code that the validity of the memslot is guaranteed, as a future patch
will rework memslot handling such that id_to_memslot() can return NULL.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# e96c81ee 18-Feb-2020 Sean Christopherson <seanjc@google.com>

KVM: Simplify kvm_free_memslot() and all its descendents

Now that all callers of kvm_free_memslot() pass NULL for @dont, remove
the param from the top-level routine and all arch's implementations.

No functional change intended.

Tested-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 82307e67 18-Feb-2020 Sean Christopherson <seanjc@google.com>

KVM: PPC: Move memslot memory allocation into prepare_memory_region()

Allocate the rmap array during kvm_arch_prepare_memory_region() to pave
the way for removing kvm_arch_create_memslot() altogether. Moving PPC's
memory allocation only changes the order of kernel memory allocations
between PPC and common KVM code.

No functional change intended.

Acked-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# afede96d 18-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: Drop kvm_arch_vcpu_setup()

Remove kvm_arch_vcpu_setup() now that all arch specific implementations
are nops.

Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ff030fdf 18-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: PPC: Move kvm_vcpu_init() invocation to common code

Move the kvm_cpu_{un}init() calls to common PPC code as an intermediate
step towards removing kvm_cpu_{un}init() altogether.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c50bfbdc 18-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: PPC: Allocate vcpu struct in common PPC code

Move allocation of all flavors of PPC vCPUs to common PPC code. All
variants either allocate 'struct kvm_vcpu' directly, or require that
the embedded 'struct kvm_vcpu' member be located at offset 0, i.e.
guarantee that the allocation can be directly interpreted as a 'struct
kvm_vcpu' object.

Remove the message from the build-time assertion regarding placement of
the struct, as compatibility with the arch usercopy region is no longer
the sole dependent on 'struct kvm_vcpu' being at offset zero.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 87a45e07 02-Oct-2019 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S: Replace reset_msr mmu op with inject_interrupt arch op

reset_msr sets the MSR for interrupt injection, but it's cleaner and
more flexible to provide a single op to set both MSR and PC for the
interrupt.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 9ee6471e 02-Oct-2019 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S: Define and use SRR1_MSR_BITS

Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 833b45de 30-Sep-2019 Paolo Bonzini <pbonzini@redhat.com>

kvm: x86, powerpc: do not allow clearing largepages debugfs entry

The largepages debugfs entry is incremented/decremented as shadow
pages are created or destroyed. Clearing it will result in an
underflow, which is harmless to KVM but ugly (and could be
misinterpreted by tools that use debugfs information), so make
this particular statistic read-only.

Cc: kvm-ppc@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2ad7a27d 26-Aug-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions

There are some POWER9 machines where the OPAL firmware does not support
the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
The impact of this is that a guest using XIVE natively will not be able
to be migrated successfully. On the source side, the get_attr operation
on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
will fail; on the destination side, the set_attr operation for the same
attribute will fail.

This adds tests for the existence of the OPAL get/set queue state
functions, and if they are not supported, the XIVE-native KVM device
is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
Userspace can then either provide a software emulation of XIVE, or
else tell the guest that it does not have a XIVE controller available
to it.

Cc: stable@vger.kernel.org # v5.2+
Fixes: 3fab2d10588e ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1659e27d 28-May-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list

Currently the Book 3S KVM code uses kvm->lock to synchronize access
to the kvm->arch.rtas_tokens list. Because this list is scanned
inside kvmppc_rtas_hcall(), which is called with the vcpu mutex held,
taking kvm->lock cause a lock inversion problem, which could lead to
a deadlock.

To fix this, we add a new mutex, kvm->arch.rtas_token_lock, which nests
inside the vcpu mutexes, and use that instead of kvm->lock when
accessing the rtas token list.

This removes the lockdep_assert_held() in kvmppc_rtas_tokens_free().
At this point we don't hold the new mutex, but that is OK because
kvmppc_rtas_tokens_free() is only called when the whole VM is being
destroyed, and at that point nothing can be looking up a token in
the list.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 5422e951 17-Apr-2019 Cédric Le Goater <clg@kaod.org>

KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method

When a P9 sPAPR VM boots, the CAS negotiation process determines which
interrupt mode to use (XICS legacy or XIVE native) and invokes a
machine reset to activate the chosen mode.

We introduce 'release' methods for the XICS-on-XIVE and the XIVE
native KVM devices which are called when the file descriptor of the
device is closed after the TIMA and ESB pages have been unmapped.
They perform the necessary cleanups : clear the vCPU interrupt
presenters that could be attached and then destroy the device. The
'release' methods replace the 'destroy' methods as 'destroy' is not
called anymore once 'release' is. Compatibility with older QEMU is
nevertheless maintained.

This is not considered as a safe operation as the vCPUs are still
running and could be referencing the KVM device through their
presenters. To protect the system from any breakage, the kvmppc_xive
objects representing both KVM devices are now stored in an array under
the VM. Allocation is performed on first usage and memory is freed
only when the VM exits.

[paulus@ozlabs.org - Moved freeing of xive structures to book3s.c,
put it under #ifdef CONFIG_KVM_XICS.]

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# e4945b9d 17-Apr-2019 Cédric Le Goater <clg@kaod.org>

KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE state

The state of the thread interrupt management registers needs to be
collected for migration. These registers are cached under the
'xive_saved_state.w01' field of the VCPU when the VPCU context is
pulled from the HW thread. An OPAL call retrieves the backup of the
IPB register in the underlying XIVE NVT structure and merges it in the
KVM state.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 90c73795 17-Apr-2019 Cédric Le Goater <clg@kaod.org>

KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode

This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new KVM device
to be created by QEMU, only available when running on a L0 hypervisor.
Support for nested guests is not available yet.

The XIVE device reuses the device structure of the XICS-on-XIVE device
as they have a lot in common. That could possibly change in the future
if the need arise.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 884dfb72 20-Feb-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Simplify machine check handling

This makes the handling of machine check interrupts that occur inside
a guest simpler and more robust, with less done in assembler code and
in real mode.

Now, when a machine check occurs inside a guest, we always get the
machine check event struct and put a copy in the vcpu struct for the
vcpu where the machine check occurred. We no longer call
machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
on POWER8, when a vcpu is running on an offline secondary thread and
we call machine_check_queue_event(), that calls irq_work_queue(),
which doesn't work because the CPU is offline, but instead triggers
the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
fires again and again because nothing clears the condition).

All that machine_check_queue_event() actually does is to cause the
event to be printed to the console. For a machine check occurring in
the guest, we now print the event in kvmppc_handle_exit_hv()
instead.

The assembly code at label machine_check_realmode now just calls C
code and then continues exiting the guest. We no longer either
synthesize a machine check for the guest in assembly code or return
to the guest without a machine check.

The code in kvmppc_handle_exit_hv() is extended to handle the case
where the guest is not FWNMI-capable. In that case we now always
synthesize a machine check interrupt for the guest. Previously, if
the host thinks it has recovered the machine check fully, it would
return to the guest without any notification that the machine check
had occurred. If the machine check was caused by some action of the
guest (such as creating duplicate SLB entries), it is much better to
tell the guest that it has caused a problem. Therefore we now always
generate a machine check interrupt for guests that are not
FWNMI-capable.

Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8f1f7b9b 18-Feb-2019 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Book3S HV: Add KVM stat largepages_[2M/1G]

This adds an entry to the kvm_stats_debugfs directory which provides the
number of large (2M or 1G) pages which have been used to setup the guest
mappings, for radix guests.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 03f95332 04-Feb-2019 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE

Currently, the KVM code assumes that if the host kernel is using the
XIVE interrupt controller (the new interrupt controller that first
appeared in POWER9 systems), then the in-kernel XICS emulation will
use the XIVE hardware to deliver interrupts to the guest. However,
this only works when the host is running in hypervisor mode and has
full access to all of the XIVE functionality. It doesn't work in any
nested virtualization scenario, either with PR KVM or nested-HV KVM,
because the XICS-on-XIVE code calls directly into the native-XIVE
routines, which are not initialized and cannot function correctly
because they use OPAL calls, and OPAL is not available in a guest.

This means that using the in-kernel XICS emulation in a nested
hypervisor that is using XIVE as its interrupt controller will cause a
(nested) host kernel crash. To fix this, we change most of the places
where the current code calls xive_enabled() to select between the
XICS-on-XIVE emulation and the plain XICS emulation to call a new
function, xics_on_xive(), which returns false in a guest.

However, there is a further twist. The plain XICS emulation has some
functions which are used in real mode and access the underlying XICS
controller (the interrupt controller of the host) directly. In the
case of a nested hypervisor, this means doing XICS hypercalls
directly. When the nested host is using XIVE as its interrupt
controller, these hypercalls will fail. Therefore this also adds
checks in the places where the XICS emulation wants to access the
underlying interrupt controller directly, and if that is XIVE, makes
the code use the virtual mode fallback paths, which call generic
kernel infrastructure rather than doing direct XICS access.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 748c0e31 06-Dec-2018 Lan Tianyu <Tianyu.Lan@microsoft.com>

KVM: Make kvm_set_spte_hva() return int

The patch is to make kvm_set_spte_hva() return int and caller can
check return value to determine flush tlb or not.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f032b734 11-Dec-2018 Bharata B Rao <bharata@linux.ibm.com>

KVM: PPC: Pass change type down to memslot commit function

Currently, kvm_arch_commit_memory_region() gets called with a
parameter indicating what type of change is being made to the memslot,
but it doesn't pass it down to the platform-specific memslot commit
functions. This adds the `change' parameter to the lower-level
functions so that they can use it in future.

[paulus@ozlabs.org - fix book E also.]

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d24ea8a7 07-Oct-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S: Simplify external interrupt handling

Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller). The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.

In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller. That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit. This
patch does that.

With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt. In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 1006284c 31-Jul-2018 Cameron Kaiser <spectre@floodgap.com>

KVM: PPC: Book3S PR: Exiting split hack mode needs to fixup both PC and LR

When an OS (currently only classic Mac OS) is running in KVM-PR and makes a
linked jump from code with split hack addressing enabled into code that does
not, LR is not correctly updated and reflects the previously munged PC.

To fix this, this patch undoes the address munge when exiting split
hack mode so that code relying on LR being a proper address will now
execute. This does not affect OS X or other operating systems running
on KVM-PR.

Signed-off-by: Cameron Kaiser <spectre@floodgap.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 45ef5992 05-Jul-2018 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc: remove unnecessary inclusion of asm/tlbflush.h

asm/tlbflush.h is only needed for:
- using functions xxx_flush_tlb_xxx()
- using MMU_NO_CONTEXT
- including asm-generic/pgtable.h

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 916ccadc 07-Jun-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S PR: Fix MSR setting when delivering interrupts

This makes sure that MSR "partial-function" bits are not transferred
to SRR1 when delivering an interrupt. This was causing failures in
guests running kernels that include commit f3d96e698ed0 ("powerpc/mm:
Overhaul handling of bad page faults", 2017-07-19), which added code
to check bits of SRR1 on instruction storage interrupts (ISIs) that
indicate a bad page fault. The symptom was that a guest user program
that handled a signal and attempted to return from the signal handler
would get a SIGBUS signal and die.

The code that generated ISIs and some other interrupts would
previously set bits in the guest MSR to indicate the interrupt status
and then call kvmppc_book3s_queue_irqprio(). This technique no
longer works now that kvmppc_inject_interrupt() is masking off those
bits. Instead we make kvmppc_core_queue_data_storage() and
kvmppc_core_queue_inst_storage() call kvmppc_inject_interrupt()
directly, and make sure that all the places that generate ISIs or
DSIs call kvmppc_core_queue_{data,inst}_storage instead of
kvmppc_book3s_queue_irqprio().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 8f044126 23-May-2018 Simon Guo <wei.guo.simon@gmail.com>

KVM: PPC: Book3S: Remove load/put vcpu for KVM_GET_REGS/KVM_SET_REGS

In both HV and PR KVM, the KVM_SET_REGS/KVM_GET_REGS ioctl should
be able to perform without the vcpu loaded.

Since the vcpu mutex locking/unlock has been moved out of vcpu_load()
/vcpu_put(), KVM_SET_REGS/KVM_GET_REGS don't need to do ioctl with
the vcpu loaded anymore. This patch removes vcpu_load()/vcpu_put()
from KVM_SET_REGS/KVM_GET_REGS ioctl.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 70923603 20-May-2018 Simon Guo <wei.guo.simon@gmail.com>

KVM: PPC: Reimplement non-SIMD LOAD/STORE instruction mmio emulation with analyse_instr() input

This patch reimplements non-SIMD LOAD/STORE instruction MMIO emulation
with analyse_instr() input. It utilizes the BYTEREV/UPDATE/SIGNEXT
properties exported by analyse_instr() and invokes
kvmppc_handle_load(s)/kvmppc_handle_store() accordingly.

It also moves CACHEOP type handling into the skeleton.

instruction_type within kvm_ppc.h is renamed to avoid conflict with
sstep.h.

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 39c983ea 21-Feb-2018 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Remove unused kvm_unmap_hva callback

Since commit fb1522e099f0 ("KVM: update to new mmu_notifier semantic
v2", 2017-08-31), the MMU notifier code in KVM no longer calls the
kvm_unmap_hva callback. This removes the PPC implementations of
kvm_unmap_hva().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 66b56562 04-Dec-2017 Christoffer Dall <christoffer.dall@linaro.org>

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_guest_debug

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_guest_debug().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b4ef9d4e 04-Dec-2017 Christoffer Dall <christoffer.dall@linaro.org>

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_sregs

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_sregs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# bcdec41c 04-Dec-2017 Christoffer Dall <christoffer.dall@linaro.org>

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_sregs

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_sregs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 875656fe 04-Dec-2017 Christoffer Dall <christoffer.dall@linaro.org>

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_regs

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_regs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1fc9b76b 04-Dec-2017 Christoffer Dall <christoffer.dall@linaro.org>

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_regs

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_regs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5af50993 05-Apr-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller

This patch makes KVM capable of using the XIVE interrupt controller
to provide the standard PAPR "XICS" style hypercalls. It is necessary
for proper operations when the host uses XIVE natively.

This has been lightly tested on an actual system, including PCI
pass-through with a TG3 device.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build
failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and
adding empty stubs for the kvm_xive_xxx() routines, fixup subject,
integrate fixes from Paul for building PR=y HV=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 307d9279 22-Mar-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Provide functions for queueing up FP/VEC/VSX unavailable interrupts

This provides functions that can be used for generating interrupts
indicating that a given functional unit (floating point, vector, or
VSX) is unavailable. These functions will be used in instruction
emulation code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# d3989143 05-Apr-2017 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/kvm: Massage order of #include

We traditionally have linux/ before asm/

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5a319350 30-Jan-2017 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S HV: Page table construction and page faults for radix guests

This adds the code to construct the second-level ("partition-scoped" in
architecturese) page tables for guests using the radix MMU. Apart from
the PGD level, which is allocated when the guest is created, the rest
of the tree is all constructed in response to hypervisor page faults.

As well as hypervisor page faults for missing pages, we also get faults
for reference/change (RC) bits needing to be set, as well as various
other error conditions. For now, we only set the R or C bit in the
guest page table if the same bit is set in the host PTE for the
backing page.

This code can take advantage of the guest being backed with either
transparent or ordinary 2MB huge pages, and insert 2MB page entries
into the guest page tables. There is no support for 1GB huge pages
yet.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 7c0f6ba6 24-Dec-2016 Linus Torvalds <torvalds@linux-foundation.org>

Replace <asm/uaccess.h> with <linux/uaccess.h> globally

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 88b02cf9 14-Sep-2016 Paul Mackerras <paulus@ozlabs.org>

KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread

POWER8 has one virtual timebase (VTB) register per subcore, not one
per CPU thread. The HV KVM code currently treats VTB as a per-thread
register, which can lead to spurious soft lockup messages from guests
which use the VTB as the time source for the soft lockup detector.
(CPUs before POWER8 did not have the VTB register.)

For HV KVM, this fixes the problem by making only the primary thread
in each virtual core save and restore the VTB value. With this,
the VTB state becomes part of the kvmppc_vcore structure. This
also means that "piggybacking" of multiple virtual cores onto one
subcore is not possible on POWER8, because then the virtual cores
would share a single VTB register.

PR KVM emulates a VTB register, which is per-vcpu because PR KVM
has no notion of CPU threads or SMT. For PR KVM we move the VTB
state into the kvmppc_vcpu_book3s struct.

Cc: stable@vger.kernel.org # v3.14+
Reported-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 65e7026a 18-Aug-2016 Suresh Warrier <warrier@linux.vnet.ibm.com>

KVM: PPC: Book3S HV: Counters for passthrough IRQ stats

Add VCPU stat counters to track affinity for passthrough
interrupts.

pthru_all: Counts all passthrough interrupts whose IRQ mappings are
in the kvmppc_passthru_irq_map structure.
pthru_host: Counts all cached passthrough interrupts that were injected
from the host through kvm_set_irq (i.e. not handled in
real mode).
pthru_bad_aff: Counts how many cached passthrough interrupts have
bad affinity (receiving CPU is not running VCPU that is
the target of the virtual interrupt in the guest).

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 2a27f514 01-Aug-2016 Suraj Jitindar Singh <sjitindarsingh@gmail.com>

KVM: PPC: Implement existing and add new halt polling vcpu stats

vcpu stats are used to collect information about a vcpu which can be viewed
in the debugfs. For example halt_attempted_poll and halt_successful_poll
are used to keep track of the number of times the vcpu attempts to and
successfully polls. These stats are currently not used on powerpc.

Implement incrementation of the halt_attempted_poll and
halt_successful_poll vcpu stats for powerpc. Since these stats are summed
over all the vcpus for all running guests it doesn't matter which vcpu
they are attributed to, thus we choose the current runner vcpu of the
vcore.

Also add new vcpu stats: halt_poll_success_ns, halt_poll_fail_ns and
halt_wait_ns to be used to accumulate the total time spend polling
successfully, polling unsuccessfully and waiting respectively, and
halt_successful_wait to accumulate the number of times the vcpu waits.
Given that halt_poll_success_ns, halt_poll_fail_ns and halt_wait_ns are
expressed in nanoseconds it is necessary to represent these as 64-bit
quantities, otherwise they would overflow after only about 4 seconds.

Given that the total time spend either polling or waiting will be known and
the number of times that each was done, it will be possible to determine
the average poll and wait times. This will give the ability to tune the kvm
module parameters based on the calculated average wait and poll times.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>


# 3491caf2 12-May-2016 Christian Borntraeger <borntraeger@de.ibm.com>

KVM: halt_polling: provide a way to qualify wakeups during poll

Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.

For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.

This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.

This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 366baf28 14-Feb-2016 Alexey Kardashevskiy <aik@ozlabs.ru>

KVM: PPC: Use RCU for arch.spapr_tce_tables

At the moment only spapr_tce_tables updates are protected against races
but not lookups. This fixes missing protection by using RCU for the list.
As lookups also happen in real mode, this uses
list_for_each_entry_lockless() (which is expected not to access any
vmalloc'd memory).

This converts release_spapr_tce_table() to a RCU scheduled handler.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# ba049e93 15-Jan-2016 Dan Williams <dan.j.williams@intel.com>

kvm: rename pfn_t to kvm_pfn_t

To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 18):

The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3eb4ee68 18-Sep-2015 Thomas Huth <thuth@redhat.com>

KVM: PPC: Book3S: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store()

Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
functions) has to be protected via the kvm->srcu lock.
The kvmppc_h_logical_ci_load() and -store() functions are missing
this lock so far, so let's add it there, too.
This fixes the problem that the kernel reports "suspicious RCU usage"
when lock debugging is enabled.

Cc: stable@vger.kernel.org # v4.1+
Fixes: 99342cf8044420eebdf9297ca03a14cb6a7085a1
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 62bea5bf 15-Sep-2015 Paolo Bonzini <pbonzini@redhat.com>

KVM: add halt_attempted_poll to VCPU stats

This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.

For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling. This would show as an abnormally high number of
attempted polling compared to the successful polls.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5358a963 22-May-2015 Thomas Huth <thuth@redhat.com>

KVM: PPC: Fix warnings from sparse

When compiling the KVM code for POWER with "make C=1", sparse
complains about functions missing proper prototypes and a 64-bit
constant missing the ULL prefix. Let's fix this by making the
functions static or by including the proper header with the
prototypes, and by appending a ULL prefix to the constant
PPC_MPPE_ADDRESS_MASK.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 60acc4eb 27-May-2015 Laurent Pinchart <laurent.pinchart@ideasonboard.com>

treewide: Fix typo compatability -> compatibility

Even though 'compatability' has a dedicated entry in the Wiktionary,
it's listed as 'Mispelling of compatibility'. Fix it.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> for the atomic_helper.c
Signed-off-by: Jiri Kosina <jkosina@suse.com>


# f36f3f28 18-May-2015 Paolo Bonzini <pbonzini@redhat.com>

KVM: add "new" argument to kvm_arch_commit_memory_region

This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot. It will simplify the code when more
than one address space will be supported.

Unfortunately, the "const"ness of the new argument must be casted
away in two places. Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.

Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 09170a49 18-May-2015 Paolo Bonzini <pbonzini@redhat.com>

KVM: const-ify uses of struct kvm_userspace_memory_region

Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents. Add const to
enforce this.

In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 99342cf8 04-Feb-2015 David Gibson <david@gibson.dropbear.id.au>

kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

On POWER, storage caching is usually configured via the MMU - attributes
such as cache-inhibited are stored in the TLB and the hashed page table.

This makes correctly performing cache inhibited IO accesses awkward when
the MMU is turned off (real mode). Some CPU models provide special
registers to control the cache attributes of real mode load and stores but
this is not at all consistent. This is a problem in particular for SLOF,
the firmware used on KVM guests, which runs entirely in real mode, but
which needs to do IO to load the kernel.

To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
a logical address (aka guest physical address). SLOF uses these for IO.

However, because these are implemented within qemu, not the host kernel,
these bypass any IO devices emulated within KVM itself. The simplest way
to see this problem is to attempt to boot a KVM guest from a virtio-blk
device with iothread / dataplane enabled. The iothread code relies on an
in kernel implementation of the virtio queue notification, which is not
triggered by the IO hcalls, and so the guest will stall in SLOF unable to
load the guest OS.

This patch addresses this by providing in-kernel implementations of the
2 hypercalls, which correctly scan the KVM IO bus. Any access to an
address not handled by the KVM IO bus will cause a VM exit, hitting the
qemu implementation as before.

Note that a userspace change is also required, in order to enable these
new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[agraf: fix compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>


# f7819512 04-Feb-2015 Paolo Bonzini <pbonzini@redhat.com>

kvm: add halt_poll_ns module parameter

This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.

This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.

With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.

Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.

The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.

While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.

The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 54ca162a 07-Dec-2014 Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>

arch: powerpc: kvm: book3s.c: Remove some unused functions

Removes some functions that are not used anywhere:
kvmppc_core_load_guest_debugstate() kvmppc_core_load_host_debugstate()

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 57128468 22-Sep-2014 Andres Lagar-Cavilla <andreslc@google.com>

kvm: Fix page ageing bugs

1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# a59c1d9e 09-Sep-2014 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/kvm: support to handle sw breakpoint

This patch adds kernel side support for software breakpoint.
Design is that, by using an illegal instruction, we trap to hypervisor
via Emulation Assistance interrupt, where we check for the illegal instruction
and accordingly we return to Host or Guest. Patch also adds support for
software breakpoint in PR KVM.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# d02d4d15 01-Sep-2014 Mihai Caraman <mihai.caraman@freescale.com>

KVM: PPC: Remove the tasklet used by the hrtimer

Powerpc timer implementation is a copycat version of s390. Now that they removed
the tasklet with commit ea74c0ea1b24a6978a6ebc80ba4dbc7b7848b32d follow this
optimization.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 3840edc8 20-Aug-2014 Mihai Caraman <mihai.caraman@freescale.com>

KVM: PPC: Move ONE_REG AltiVec support to powerpc

Move ONE_REG AltiVec support to powerpc generic layer.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 8a41ea53 20-Aug-2014 Mihai Caraman <mihai.caraman@freescale.com>

KVM: PPC: Make ONE_REG powerpc generic

Make ONE_REG generic for server and embedded architectures by moving
kvm_vcpu_ioctl_get_one_reg() and kvm_vcpu_ioctl_set_one_reg() functions
to powerpc layer.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 8de12015 18-Jun-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Expose helper functions for data/inst faults

We're going to implement guest code interpretation in KVM for some rare
corner cases. This code needs to be able to inject data and instruction
faults into the guest when it encounters them.

Expose generic APIs to do this in a reasonably subarch agnostic fashion.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 35c4a733 20-Jun-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Move kvmppc_ld/st to common code

We have enough common infrastructure now to resolve GVA->GPA mappings at
runtime. With this we can move our book3s specific helpers to load / store
in guest virtual address space to common code as well.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 7d15c06f 20-Jun-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Implement kvmppc_xlate for all targets

We have a nice API to find the translated GPAs of a GVA including protection
flags. So far we only use it on Book3S, but there's no reason the same shouldn't
be used on BookE as well.

Implement a kvmppc_xlate() version for BookE and clean it up to make it more
readable in general.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 1b2e33b0 19-Jul-2014 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication

At present, kvmppc_ld calls kvmppc_xlate, and if kvmppc_xlate returns
any error indication, it returns -ENOENT, which is taken to mean an
HPTE not found error. However, the error could have been a segment
found (no SLB entry) or a permission error. Similarly,
kvmppc_pte_to_hva currently does permission checking, but any error
from it is taken by kvmppc_ld to mean that the access is an emulated
MMIO access. Also, kvmppc_ld does no execute permission checking.

This fixes these problems by (a) returning any error from kvmppc_xlate
directly, (b) moving the permission check from kvmppc_pte_to_hva
into kvmppc_ld, and (c) adding an execute permission check to kvmppc_ld.

This is similar to what was done for kvmppc_st() by commit 82ff911317c3
("KVM: PPC: Deflect page write faults properly in kvmppc_st").

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 51f04726 23-Jul-2014 Mihai Caraman <mihai.caraman@freescale.com>

KVM: PPC: Allow kvmppc_get_last_inst() to fail

On book3e, guest last instruction is read on the exit path using load
external pid (lwepx) dedicated instruction. This load operation may fail
due to TLB eviction and execute-but-not-read entries.

This patch lay down the path for an alternative solution to read the guest
last instruction, by allowing kvmppc_get_lat_inst() function to fail.
Architecture specific implmentations of kvmppc_load_last_inst() may read
last guest instruction and instruct the emulation layer to re-execute the
guest in case of failure.

Make kvmppc_get_last_inst() definition common between architectures.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 89b68c96 13-Jul-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S: Make magic page properly 4k mappable

The magic page is defined as a 4k page of per-vCPU data that is shared
between the guest and the host to accelerate accesses to privileged
registers.

However, when the host is using 64k page size granularity we weren't quite
as strict about that rule anymore. Instead, we partially treated all of the
upper 64k as magic page and mapped only the uppermost 4k with the actual
magic contents.

This works well enough for Linux which doesn't use any memory in kernel
space in the upper 64k, but Mac OS X got upset. So this patch makes magic
page actually stay in a 4k range even on 64k page size hosts.

This patch fixes magic page usage with Mac OS X (using MOL) on 64k PAGE_SIZE
hosts for me.

Signed-off-by: Alexander Graf <agraf@suse.de>


# c01e3f66 10-Jul-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S: Add hack for split real mode

Today we handle split real mode by mapping both instruction and data faults
into a special virtual address space that only exists during the split mode
phase.

This is good enough to catch 32bit Linux guests that use split real mode for
copy_from/to_user. In this case we're always prefixed with 0xc0000000 for our
instruction pointer and can map the user space process freely below there.

However, that approach fails when we're running KVM inside of KVM. Here the 1st
level last_inst reader may well be in the same virtual page as a 2nd level
interrupt handler.

It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a
kernel copy_from/to_user implementation can easily overlap with user space
addresses.

The architecturally correct way to fix this would be to implement an instruction
interpreter in KVM that kicks in whenever we go into split real mode. This
interpreter however would not receive a great amount of testing and be a lot of
bloat for a reasonably isolated corner case.

So I went back to the drawing board and tried to come up with a way to make
split real mode work with a single flat address space. And then I realized that
we could get away with the same trick that makes it work for Linux:

Whenever we see an instruction address during split real mode that may collide,
we just move it higher up the virtual address space to a place that hopefully
does not collide (keep your fingers crossed!).

That approach does work surprisingly well. I am able to successfully run
Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I
apply a tiny timing probe hack to QEMU. I'd say this is a win over even more
broken split real mode :).

Signed-off-by: Alexander Graf <agraf@suse.de>


# 17824b5a 10-Jul-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Deflect page write faults properly in kvmppc_st

When we have a page that we're not allowed to write to, xlate() will already
tell us -EPERM on lookup of that page. With the code as is we change it into
a "page missing" error which a guest may get confused about. Instead, just
tell the caller about the -EPERM directly.

This fixes Mac OS X guests when run with DCBZ32 emulation.

Signed-off-by: Alexander Graf <agraf@suse.de>


# ae2113a4 01-Jun-2014 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled

This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
capability is used to enable or disable in-kernel handling of an
hcall, that the hcall is actually implemented by the kernel.
If not an EINVAL error is returned.

This also checks the default-enabled list of hcalls and prints a
warning if any hcall there is not actually implemented.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 06da28e7 05-Jun-2014 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

KVM: PPC: BOOK3S: PR: Emulate instruction counter

Writing to IC is not allowed in the privileged mode.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 8f42ab27 05-Jun-2014 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

KVM: PPC: BOOK3S: PR: Emulate virtual timebase register

virtual time base register is a per VM, per cpu register that needs
to be saved and restored on vm exit and entry. Writing to VTB is not
allowed in the privileged mode.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: fix compile error]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 2e23f544 29-Apr-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S PR: Expose EBB registers

POWER8 introduces a new facility called the "Event Based Branch" facility.
It contains of a few registers that indicate where a guest should branch to
when a defined event occurs and it's in PR mode.

We don't want to really enable EBB as it will create a big mess with !PR guest
mode while hardware is in PR and we don't really emulate the PMU anyway.

So instead, let's just leave it at emulation of all its registers.

Signed-off-by: Alexander Graf <agraf@suse.de>


# e14e7a1e 21-Apr-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S PR: Expose TAR facility to guest

POWER8 implements a new register called TAR. This register has to be
enabled in FSCR and then from KVM's point of view is mere storage.

This patch enables the guest to use TAR.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 616dff86 29-Apr-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR

POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
which contains its status message in a new register called FSCR.

Handle these exits and try to emulate instructions for unhandled facilities.
Follow-on patches enable KVM to expose specific facilities into the guest.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 5deb8e7a 24-Apr-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Make shared struct aka magic page guest endian

The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.

When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.

Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.

For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.

For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.

Signed-off-by: Alexander Graf <agraf@suse.de>


# ab78475c 06-Apr-2014 Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S: ifdef on CONFIG_KVM_BOOK3S_32_HANDLER for 32bit

The book3s_32 target can get built as module which means we don't see the
config define for it in code. Instead, check on the bool define
CONFIG_KVM_BOOK3S_32_HANDLER whenever we want to know whether we're building
for a book3s_32 host.

This fixes running book3s_32 kvm as a module for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


# efff1912 15-Oct-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Store FP/VSX/VMX state in thread_fp/vr_state structures

This uses struct thread_fp_state and struct thread_vr_state to store
the floating-point, VMX/Altivec and VSX state, rather than flat arrays.
This makes transferring the state to/from the thread_struct simpler
and allows us to unify the get/set_one_reg implementations for the
VSX registers.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 398a76c6 09-Dec-2013 Alexander Graf <agraf@suse.de>

KVM: PPC: Add devname:kvm aliases for modules

Systems that support automatic loading of kernel modules through
device aliases should try and automatically load kvm when /dev/kvm
gets opened.

Add code to support that magic for all PPC kvm targets, even the
ones that don't support modules yet.

Signed-off-by: Alexander Graf <agraf@suse.de>


# a78b55d1 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: book3s: drop is_hv_enabled

drop is_hv_enabled, because that should not be a callback property

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# cbbc58d4 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: book3s: Allow the HV and PR selection per virtual machine

This moves the kvmppc_ops callbacks to be a per VM entity. This
enables us to select HV and PR mode when creating a VM. We also
allow both kvm-hv and kvm-pr kernel module to be loaded. To
achieve this we move /dev/kvm ownership to kvm.ko module. Depending on
which KVM mode we select during VM creation we take a reference
count on respective module

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: fix coding style]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 5587027c 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: Add struct kvm arg to memslot APIs

We will use that in the later patch to find the kvm ops handler

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 2ba9f0d8 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: book3s: Support building HV and PR KVM as module

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: squash in compile fix]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 699cc876 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: book3s: Add is_hv_enabled to kvmppc_ops

This help us to identify whether we are running with hypervisor mode KVM
enabled. The change is needed so that we can have both HV and PR kvm
enabled in the same kernel.

If both HV and PR KVM are included, interrupts come in to the HV version
of the kvmppc_interrupt code, which then jumps to the PR handler,
renamed to kvmppc_interrupt_pr, if the guest is a PR guest.

Allowing both PR and HV in the same kernel required some changes to
kvm_dev_ioctl_check_extension(), since the values returned now can't
be selected with #ifdefs as much as previously. We look at is_hv_enabled
to return the right value when checking for capabilities.For capabilities that
are only provided by HV KVM, we return the HV value only if
is_hv_enabled is true. For capabilities provided by PR KVM but not HV,
we return the PR value only if is_hv_enabled is false.

NOTE: in later patch we replace is_hv_enabled with a static inline
function comparing kvm_ppc_ops

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 3a167bea 07-Oct-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

kvm: powerpc: Add kvmppc_ops callback

This patch add a new callback kvmppc_ops. This will help us in enabling
both HV and PR KVM together in the same kernel. The actual change to
enable them together is done in the later patch in the series.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: squash in booke changes]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 93b159b4 19-Sep-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S PR: Better handling of host-side read-only pages

Currently we request write access to all pages that get mapped into the
guest, even if the guest is only loading from the page. This reduces
the effectiveness of KSM because it means that we unshare every page we
access. Also, we always set the changed (C) bit in the guest HPTE if
it allows writing, even for a guest load.

This fixes both these problems. We pass an 'iswrite' flag to the
mmu.xlate() functions and to kvmppc_mmu_map_page() to indicate whether
the access is a load or a store. The mmu.xlate() functions now only
set C for stores. kvmppc_gfn_to_pfn() now calls gfn_to_pfn_prot()
instead of gfn_to_pfn() so that it can indicate whether we need write
access to the page, and get back a 'writable' flag to indicate whether
the page is writable or not. If that 'writable' flag is clear, we then
make the host HPTE read-only even if the guest HPTE allowed writing.

This means that we can get a protection fault when the guest writes to a
page that it has mapped read-write but which is read-only on the host
side (perhaps due to KSM having merged the page). Thus we now call
kvmppc_handle_pagefault() for protection faults as well as HPTE not found
faults. In kvmppc_handle_pagefault(), if the access was allowed by the
guest HPTE and we thus need to install a new host HPTE, we then need to
remove the old host HPTE if there is one. This is done with a new
function, kvmppc_mmu_unmap_page(), which uses kvmppc_mmu_pte_vflush() to
find and remove the old host HPTE.

Since the memslot-related functions require the KVM SRCU read lock to
be held, this adds srcu_read_lock/unlock pairs around the calls to
kvmppc_handle_pagefault().

Finally, this changes kvmppc_mmu_book3s_32_xlate_pte() to not ignore
guest HPTEs that don't permit access, and to return -EPERM for accesses
that are not permitted by the page protections.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# c0867fd5 05-Sep-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Add GET/SET_ONE_REG interface for VRSAVE

The VRSAVE register value for a vcpu is accessible through the
GET/SET_SREGS interface for Book E processors, but not for Book 3S
processors. In order to make this accessible for Book 3S processors,
this adds a new register identifier for GET/SET_ONE_REG, and adds
the code to implement it.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 8b78645c 17-Apr-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler state

This adds the ability for userspace to save and restore the state
of the XICS interrupt presentation controllers (ICPs) via the
KVM_GET/SET_ONE_REG interface. Since there is one ICP per vcpu, we
simply define a new 64-bit register in the ONE_REG space for the ICP
state. The state includes the CPU priority setting, the pending IPI
priority, and the priority and source number of any pending external
interrupt.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# bc5ad3f3 17-Apr-2013 Benjamin Herrenschmidt <benh@kernel.crashing.org>

KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs. Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Each vcpu gets one ICP (interrupt controller presentation) entity,
used to store the per-vcpu state such as vcpu priority, pending
interrupt state, IPI request, etc.

This does not include any API or any way to connect vcpus to their
ICP state; that will be added in later patches.

This is based on an initial implementation by Michael Ellerman
<michael@ellerman.id.au> reworked by Benjamin Herrenschmidt and
Paul Mackerras.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix typo, add dependency on !KVM_MPIC]
Signed-off-by: Alexander Graf <agraf@suse.de>


# 092d62ee 07-Apr-2013 Bharat Bhushan <r65777@freescale.com>

KVM: PPC: debug stub interface parameter defined

This patch defines the interface parameter for KVM_SET_GUEST_DEBUG
ioctl support. Follow up patches will use this for setting up
hardware breakpoints, watchpoints and software breakpoints.

Also kvm_arch_vcpu_ioctl_set_guest_debug() is brought one level below.
This is because I am not sure what is required for book3s. So this ioctl
behaviour will not change for book3s.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 8c32a2ea 20-Mar-2013 Bharat Bhushan <r65777@freescale.com>

Added ONE_REG interface for debug instruction

This patch adds the one_reg interface to get the special instruction
to be used for setting software breakpoint from userspace.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 4fe27d2a 14-Feb-2013 Paul Mackerras <paulus@samba.org>

KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external

Currently kvmppc_core_dequeue_external() takes a struct kvm_interrupt *
argument and does nothing with it, in any of its implementations.
This removes it in order to make things easier for forthcoming
in-kernel interrupt controller emulation code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# a8bd19ef 25-Sep-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Get/set guest FP regs using the GET/SET_ONE_REG interface

This enables userspace to get and set all the guest floating-point
state using the KVM_[GS]ET_ONE_REG ioctls. The floating-point state
includes all of the traditional floating-point registers and the
FPSCR (floating point status/control register), all the VMX/Altivec
vector registers and the VSCR (vector status/control register), and
on POWER7, the vector-scalar registers (note that each FP register
is the high-order half of the corresponding VSR).

Most of these are implemented in common Book 3S code, except for VSX
on POWER7. Because HV and PR differ in how they store the FP and VSX
registers on POWER7, the code for these cases is not common. On POWER7,
the FP registers are the upper halves of the VSX registers vsr0 - vsr31.
PR KVM stores vsr0 - vsr31 in two halves, with the upper halves in the
arch.fpr[] array and the lower halves in the arch.vsr[] array, whereas
HV KVM on POWER7 stores the whole VSX register in arch.vsr[].

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix whitespace, vsx compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>


# a136a8bd 25-Sep-2012 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface

This enables userspace to get and set various SPRs (special-purpose
registers) using the KVM_[GS]ET_ONE_REG ioctls. With this, userspace
can get and set all the SPRs that are part of the guest state, either
through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or
the KVM_[GS]ET_ONE_REG ioctls.

The SPRs that are added here are:

- DABR: Data address breakpoint register
- DSCR: Data stream control register
- PURR: Processor utilization of resources register
- SPURR: Scaled PURR
- DAR: Data address register
- DSISR: Data storage interrupt status register
- AMR: Authority mask register
- UAMOR: User authority mask override register
- MMCR0, MMCR1, MMCRA: Performance monitor unit control registers
- PMC1..PMC8: Performance monitor unit counter registers

In order to reduce code duplication between PR and HV KVM code, this
moves the kvm_vcpu_ioctl_[gs]et_one_reg functions into book3s.c and
centralizes the copying between user and kernel space there. The
registers that are handled differently between PR and HV, and those
that exist only in one flavor, are handled in kvmppc_[gs]et_one_reg()
functions that are specific to each flavor.

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: minimal style fixes]
Signed-off-by: Alexander Graf <agraf@suse.de>


# f61c94bb 08-Aug-2012 Bharat Bhushan <r65777@freescale.com>

KVM: PPC: booke: Add watchdog emulation

This patch adds the watchdog emulation in KVM. The watchdog
emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_BOOKE_WATCHDOG) ioctl.
The kernel timer are used for watchdog emulation and emulates
h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
it is configured.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
[bharat.bhushan@freescale.com: reworked patch]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
[agraf: adjust to new request framework]
Signed-off-by: Alexander Graf <agraf@suse.de>


# bbcc9c06 13-Mar-2012 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/kvm: Fix magic page vs. 32-bit RTAS on ppc64

When the kernel calls into RTAS, it switches to 32-bit mode. The
magic page was is longer accessible in that case, causing the
patched instructions in the RTAS call wrapper to crash.

This fixes it by making available a 32-bit mapping of the magic
page in that case. This mapping is flushed whenever we switch
the kernel back to 64-bit mode.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[agraf: add a check if the magic page is mapped]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# a8e4ef84 16-Feb-2012 Alexander Graf <agraf@suse.de>

KVM: PPC: booke: rework rescheduling checks

Instead of checking whether we should reschedule only when we exited
due to an interrupt, let's always check before entering the guest back
again. This gets the target more in line with the other archs.

Also while at it, generalize the whole thing so that eventually we could
have a single kvmppc_prepare_to_enter function for all ppc targets that
does signal and reschedule checking for us.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 82ed3616 14-Dec-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3s HV: Implement get_dirty_log using hardware changed bit

This changes the implementation of kvm_vm_ioctl_get_dirty_log() for
Book3s HV guests to use the hardware C (changed) bits in the guest
hashed page table. Since this makes the implementation quite different
from the Book3s PR case, this moves the existing implementation from
book3s.c to book3s_pr.c and creates a new implementation in book3s_hv.c.
That implementation calls kvmppc_hv_get_dirty_log() to do the actual
work by calling kvm_test_clear_dirty on each page. It iterates over
the HPTEs, clearing the C bit if set, and returns 1 if any C bit was
set (including the saved C bit in the rmap entry).

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# dfd4d47e 16-Nov-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: booke: Improve timer register emulation

Decrementers are now properly driven by TCR/TSR, and the guest
has full read/write access to these registers.

The decrementer keeps ticking (and setting the TSR bit) regardless of
whether the interrupts are enabled with TCR.

The decrementer stops at zero, rather than going negative.

Decrementers (and FITs, once implemented) are delivered as
level-triggered interrupts -- dequeued when the TSR bit is cleared, not
on delivery.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
[scottwood@freescale.com: significant changes]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# b5904972 08-Nov-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn

This allows additional registers to be accessed by the guest
in PR-mode KVM without trapping.

SPRG4-7 are readable from userspace. On booke, KVM will sync
these registers when it enters the guest, so that accesses from
guest userspace will work. The guest kernel, OTOH, must consistently
use either the real registers or the shared area between exits. This
also applies to the already-paravirted SPRG3.

On non-booke, it's not clear to what extent SPRG4-7 are supported
(they're not architected for book3s, but exist on at least some classic
chips). They are copied in the get/set regs ioctls, but I do not see any
non-booke emulation. I also do not see any syncing with real registers
(in PR-mode) including the user-readable SPRG3. This patch should not
make that situation any worse.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7e28e60e 08-Nov-2011 Scott Wood <scottwood@freescale.com>

KVM: PPC: Rename deliver_interrupts to prepare_to_enter

This function also updates paravirt int_pending, so rename it
to be more obvious that this is a collection of checks run prior
to (re)entering a guest.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 28a37544 24-Nov-2011 Xiao Guangrong <xiaoguangrong.eric@gmail.com>

KVM: introduce id_to_memslot function

Introduce id_to_memslot to get memslot by slot id

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 66b15db6 27-May-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

powerpc: add export.h to files making use of EXPORT_SYMBOL

With module.h being implicitly everywhere via device.h, the absence
of explicitly including something for EXPORT_SYMBOL went unnoticed.
Since we are heading to fix things up and clean module.h from the
device.h file, we need to explicitly include these files now.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 3cf658b6 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Deliver program interrupts right away instead of queueing them

Doing so means that we don't have to save the flags anywhere and gets
rid of the last reference to to_book3s(vcpu) in arch/powerpc/kvm/book3s.c.

Doing so is OK because a program interrupt won't be generated at the
same time as any other synchronous interrupt. If a program interrupt
and an asynchronous interrupt (external or decrementer) are generated
at the same time, the program interrupt will be delivered, which is
correct because it has a higher priority, and then the asynchronous
interrupt will be masked.

We don't ever generate system reset or machine check interrupts to the
guest, but if we did, then we would need to make sure they got delivered
rather than the program interrupt. The current code would be wrong in
this situation anyway since it would deliver the program interrupt as
well as the reset/machine check interrupt.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# f05ed4d5 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Split out code from book3s.c into book3s_pr.c

In preparation for adding code to enable KVM to use hypervisor mode
on 64-bit Book 3S processors, this splits book3s.c into two files,
book3s.c and book3s_pr.c, where book3s_pr.c contains the code that is
specific to running the guest in problem state (user mode) and book3s.c
contains code which should apply to all Book 3S processors.

In doing this, we abstract some details, namely the interrupt offset,
updating the interrupt pending flag, and detecting if the guest is
in a critical section. These are all things that will be different
when we use hypervisor mode.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# c4befc58 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Move fields between struct kvm_vcpu_arch and kvmppc_vcpu_book3s

This moves the slb field, which represents the state of the emulated
SLB, from the kvmppc_vcpu_book3s struct to the kvm_vcpu_arch, and the
hpte_hash_[v]pte[_long] fields from kvm_vcpu_arch to kvmppc_vcpu_book3s.
This is in accord with the principle that the kvm_vcpu_arch struct
represents the state of the emulated CPU, and the kvmppc_vcpu_book3s
struct holds the auxiliary data structures used in the emulation.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 149dbdb1 28-Jun-2011 Paul Mackerras <paulus@samba.org>

KVM: PPC: Fix machine checks on 32-bit Book3S

Commit 69acc0d3ba ("KVM: PPC: Resolve real-mode handlers through
function exports") resulted in vcpu->arch.trampoline_lowmem and
vcpu->arch.trampoline_enter ending up with kernel virtual addresses
rather than physical addresses. This is OK on 64-bit Book3S machines,
which ignore the top 4 bits of the effective address in real mode,
but on 32-bit Book3S machines, accessing these addresses in real mode
causes machine check interrupts, as the hardware uses the whole
effective address as the physical address in real mode.

This fixes the problem by using __pa() to convert these addresses
to physical addresses.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# a22a2dac 07-Jun-2011 Alexander Graf <agraf@suse.de>

KVM: PPC: Resolve real-mode handlers through function exports

Up until now, Book3S KVM had variables stored in the kernel that a kernel module
or the kvm code in the kernel could read from to figure out where some real mode
helper functions are located.

This is all unnecessary. The high bits of the EA get ignore in real mode, so we
can just use the pointer as is. Also, it's a lot easier on relocations when we
use the normal way of resolving the address to a function, instead of jumping
through hoops.

This patch fixes compilation with CONFIG_RELOCATABLE=y.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 44075d95 10-May-2011 Paul Mackerras <paulus@samba.org>

powerpc/kvm: Fix kvmppc_core_pending_dec

The vcpu->arch.pending_exceptions field is a bitfield indexed by
interrupt priority number as returned by kvmppc_book3s_vec2irqprio.
However, kvmppc_core_pending_dec was using an interrupt vector shifted
by 7 as the bit index. Fix it to use the irqprio value for the
decrementer interrupt instead. This problem was found by code
inspection.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# bc9c1933 29-Dec-2010 Peter Tyser <ptyser@xes-inc.com>

KVM: PPC: Fix SPRG get/set for Book3S and BookE

Previously SPRGs 4-7 were improperly read and written in
kvm_arch_vcpu_ioctl_get_regs() and kvm_arch_vcpu_ioctl_set_regs();

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Peter Tyser <ptyser@xes-inc.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 26535037 01-Nov-2010 Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>

KVM: replace vmalloc and memset with vzalloc

Let's use newly introduced vzalloc().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 17bd1580 30-Aug-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Implement Level interrupts on Book3S

The current interrupt logic is just completely broken. We get a notification
from user space, telling us that an interrupt is there. But then user space
expects us that we just acknowledge an interrupt once we deliver it to the
guest.

This is not how real hardware works though. On real hardware, the interrupt
controller pulls the external interrupt line until it gets notified that the
interrupt was received.

So in reality we have two events: pulling and letting go of the interrupt line.

To maintain backwards compatibility, I added a new request for the pulling
part. The letting go part was implemented earlier already.

With this in place, we can now finally start guests that do not randomly stall
and stop to work at random times.

This patch implements above logic for Book3S.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 296c19d0 15-Aug-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Don't put MSR_POW in MSR

On Book3S a mtmsr with the MSR_POW bit set indicates that the OS is in
idle and only needs to be waked up on the next interrupt.

Now, unfortunately we let that bit slip into the stored MSR value which
is not what the real CPU does, so that we ended up executing code like
this:

r = mfmsr();
/* r containts MSR_POW */
mtmsr(r | MSR_EE);

This obviously breaks, as we're going into idle mode in code sections that
don't expect to be idling.

This patch masks MSR_POW out of the stored MSR value on wakeup, making
guests happy again.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 9ee18b1e 04-Aug-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Update int_pending also on dequeue

When having a decrementor interrupt pending, the dequeuing happens manually
through an mtdec instruction. This instruction simply calls dequeue on that
interrupt, so the int_pending hint doesn't get updated.

This patch enables updating the int_pending hint also on dequeue, thus
correctly enabling guests to stay in guest contexts more often.

Signed-off-by: Alexander Graf <agraf@suse.de>


# df1bfa25 02-Aug-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Put segment registers in shared page

Now that the actual mtsr doesn't do anything anymore, we can move the sr
contents over to the shared page, so a guest can directly read and write
its sr contents from guest context.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 8e865178 02-Aug-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Interpret SR registers on demand

Right now we're examining the contents of Book3s_32's segment registers when
the register is written and put the interpreted contents into a struct.

There are two reasons this is bad. For starters, the struct has worse real-time
performance, as it occupies more ram. But the more important part is that with
segment registers being interpreted from their raw values, we can put them in
the shared page, allowing guests to mess with them directly.

This patch makes the internal representation of SRs be u32s.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 2e602847 02-Aug-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Don't flush PTEs on NX/RO hit

When hitting a no-execute or read-only data/inst storage interrupt we were
flushing the respective PTE so we're sure it gets properly overwritten next.

According to the spec, this is unnecessary though. The guest issues a tlbie
anyways, so we're safe to just keep the PTE around and have it manually removed
from the guest, saving us a flush.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 4cb6b7ea 02-Aug-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Preload magic page when in kernel mode

When the guest jumps into kernel mode and has the magic page mapped, theres a
very high chance that it will also use it. So let's detect that scenario and
map the segment accordingly.

Signed-off-by: Alexander Graf <agraf@suse.de>


# bed1ed98 02-Aug-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Move EXIT_DEBUG partially to tracepoints

We have a debug printk on every exit that is usually #ifdef'ed out. Using
tracepoints makes a lot more sense here though, as they can be dynamically
enabled.

This patch converts the most commonly used debug printks of EXIT_DEBUG to
tracepoints.

Signed-off-by: Alexander Graf <agraf@suse.de>


# 646bab55 16-Aug-2010 Wei Yongjun <yjwei@cn.fujitsu.com>

KVM: PPC: fix leakage of error page in kvmppc_patch_dcbz()

Add kvm_release_page_clean() after is_error_page() to avoid
leakage of error page.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# e8508940 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Magic Page Book3s support

We need to override EA as well as PA lookups for the magic page. When the guest
tells us to project it, the magic page overrides any guest mappings.

In order to reflect that, we need to hook into all the MMU layers of KVM to
force map the magic page if necessary.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 28e83b4f 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Make PAM a define

On PowerPC it's very normal to not support all of the physical RAM in real mode.
To check if we're matching on the shared page or not, we need to know the limits
so we can restrain ourselves to that range.

So let's make it a define instead of open-coding it. And while at it, let's also
increase it.

Signed-off-by: Alexander Graf <agraf@suse.de>

v2 -> v3:

- RMO -> PAM (non-magic page)
Signed-off-by: Avi Kivity <avi@redhat.com>


# 90bba358 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Tell guest about pending interrupts

When the guest turns on interrupts again, it needs to know if we have an
interrupt pending for it. Because if so, it should rather get out of guest
context and get the interrupt.

So we introduce a new field in the shared page that we use to tell the guest
that there's a pending interrupt lying around.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 5c6cedf4 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add PV guest critical sections

When running in hooked code we need a way to disable interrupts without
clobbering any interrupts or exiting out to the hypervisor.

To achieve this, we have an additional critical field in the shared page. If
that field is equal to the r1 register of the guest, it tells the hypervisor
that we're in such a critical section and thus may not receive any interrupts.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 2a342ed5 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Implement hypervisor interface

To communicate with KVM directly we need to plumb some sort of interface
between the guest and KVM. Usually those interfaces use hypercalls.

This hypercall implementation is described in the last patch of the series
in a special documentation file. Please read that for further information.

This patch implements stubs to handle KVM PPC hypercalls on the host and
guest side alike.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# a73a9599 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Convert SPRG[0-4] to shared page

When in kernel mode there are 4 additional registers available that are
simple data storage. Instead of exiting to the hypervisor to read and
write those, we can just share them with the guest using the page.

This patch converts all users of the current field to the shared page.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# de7906c3 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Convert SRR0 and SRR1 to shared page

The SRR0 and SRR1 registers contain cached values of the PC and MSR
respectively. They get written to by the hypervisor when an interrupt
occurs or directly by the kernel. They are also used to tell the rfi(d)
instruction where to jump to.

Because it only gets touched on defined events that, it's very simple to
share with the guest. Hypervisor and guest both have full r/w access.

This patch converts all users of the current field to the shared page.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 5e030186 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Convert DAR to shared page.

The DAR register contains the address a data page fault occured at. This
register behaves pretty much like a simple data storage register that gets
written to on data faults. There is no hypervisor interaction required on
read or write.

This patch converts all users of the current field to the shared page.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# d562de48 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Convert DSISR to shared page

The DSISR register contains information about a data page fault. It is fully
read/write from inside the guest context and we don't need to worry about
interacting based on writes of this register.

This patch converts all users of the current field to the shared page.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 666e7252 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Convert MSR to shared page

One of the most obvious registers to share with the guest directly is the
MSR. The MSR contains the "interrupts enabled" flag which the guest has to
toggle in critical sections.

So in order to bring the overhead of interrupt en- and disabling down, let's
put msr into the shared page. Keep in mind that even though you can fully read
its contents, writing to it doesn't always update all state. There are a few
safe fields that don't require hypervisor interaction. See the documentation
for a list of MSR bits that are safe to be set from inside the guest.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 96bc451a 29-Jul-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Introduce shared page

For transparent variable sharing between the hypervisor and guest, I introduce
a shared page. This shared page will contain all the registers the guest can
read and write safely without exiting guest context.

This patch only implements the stubs required for the basic structure of the
shared page. The actual register moving follows.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# fef093be 30-Jun-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Make use of hash based Shadow MMU

We just introduced generic functions to handle shadow pages on PPC.
This patch makes the respective backends make use of them, getting
rid of a lot of duplicate code along the way.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 49f6be8e 31-May-2010 Andreas Schwab <schwab@linux-m68k.org>

KVM: PPC: elide struct thread_struct instances from stack

Instead of instantiating a whole thread_struct on the stack use only the
required parts of it.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 2122ff5e 13-May-2010 Avi Kivity <avi@redhat.com>

KVM: move vcpu locking to dispatcher for generic vcpu ioctls

All vcpu ioctls need to be locked, so instead of locking each one specifically
we lock at the generic dispatcher.

This patch only updates generic ioctls and leaves arch specific ioctls alone.

Signed-off-by: Avi Kivity <avi@redhat.com>


# 98001d8d 13-May-2010 Avi Kivity <avi@redhat.com>

KVM: PPC: Add missing vcpu_load()/vcpu_put() in vcpu ioctls

Signed-off-by: Avi Kivity <avi@redhat.com>


# 0ee75bea 28-Apr-2010 Avi Kivity <avi@redhat.com>

KVM: Let vcpu structure alignment be determined at runtime

vmx and svm vcpus have different contents and therefore may have different
alignmment requirements. Let each specify its required alignment.

Signed-off-by: Avi Kivity <avi@redhat.com>


# 329d20ba 26-Apr-2010 Stephen Rothwell <sfr@canb.auug.org.au>

KVM: powerpc: use of kzalloc/kfree requires including slab.h

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# b83d4a9c 19-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Enable native paired singles

When we're on a paired single capable host, we can just always enable
paired singles and expose them to the guest directly.

This approach breaks when multiple VMs run and access PS concurrently,
but this should suffice until we get a proper framework for it in Linux.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# f7bc74e1 19-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Improve split mode

When in split mode, instruction relocation and data relocation are not equal.

So far we implemented this mode by reserving a special pseudo-VSID for the
two cases and flushing all PTEs when going into split mode, which is slow.

Unfortunately 32bit Linux and Mac OS X use split mode extensively. So to not
slow down things too much, I came up with a different idea: Mark the split
mode with a bit in the VSID and then treat it like any other segment.

This means we can just flush the shadow segment cache, but keep the PTEs
intact. I verified that this works with ppc32 Linux and Mac OS X 10.4
guests and does speed them up.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7fdaec99 19-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Make Performance Counters work

When we get a performance counter interrupt we need to route it on to the
Linux handler after we got out of the guest context. We also need to tell
our handling code that this particular interrupt doesn't need treatment.

So let's add those two bits in, making perf work while having a KVM guest
running.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# af7b4d10 19-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Convert u64 -> ulong

There are some pieces in the code that I overlooked that still use
u64s instead of longs. This slows down 32 bit hosts unnecessarily, so
let's just move them to ulong.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# ada7ba17 15-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Check max IRQ prio

We have a define on what the highest bit of IRQ priorities is. So we can
just as well use it in the bit checking code and avoid invalid IRQ values
to be triggered.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 07b0907d 15-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add Book3S compatibility code

Some code we had so far required defines and had code that was completely
Book3S_64 specific. Since we now opened book3s.c to Book3S_32 too, we need
to take care of these pieces.

So let's add some minor code where it makes sense to not go the Book3S_64
code paths and add compat defines on others.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 61db97cc 15-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Emulate segment fault

Book3S_32 doesn't know about segment faults. It only knows about page faults.
So in order to know that we didn't map a segment, we need to fake segment
faults.

We do this by setting invalid segment registers to an invalid VSID and then
check for that VSID on normal page faults.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 9cc5e953 15-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Extract MMU init

The host shadow mmu code needs to get initialized. It needs to fetch a
segment it can use to put shadow PTEs into.

That initialization code was in generic code, which is icky. Let's move
it over to the respective MMU file.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# c7f38f46 15-Apr-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Improve indirect svcpu accessors

We already have some inline fuctions we use to access vcpu or svcpu structs,
depending on whether we're on booke or book3s. Since we just put a few more
registers into the svcpu, we also need to make sure the respective callbacks
are available and get used.

So this patch moves direct use of the now in the svcpu struct fields to
inline function calls. While at it, it also moves the definition of those
inline function calls to respective header files for booke and book3s,
greatly improving readability.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 05b0ab1c 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Disable MSR_FEx for Cell hosts

Cell can't handle MSR_FE0 and MSR_FE1 too well. It gets dog slow.
So let's just override the guest whenever we see one of the two and mask them
out. See commit ddf5f75a16b3e7460ffee881795aa168dffcd0cf for reference.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 9fb244a2 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Fix dcbz emulation

On most systems we need to emulate dcbz when running 32 bit guests. So
far we've been rather slack, not giving correct DSISR values to the guest.

This patch makes the emulation more accurate, introducing a difference
between "page not mapped" and "write protection fault". While at it, it
also speeds up dcbz emulation by an order of magnitude by using kmap.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# a2b07664 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Make build work without CONFIG_VSX/ALTIVEC

The FPU/Altivec/VSX enablement also brought access to some structure
elements that are only defined when the respective config options
are enabled.

Unfortuately I forgot to check for the config options at some places,
so let's do that now.

Unbreaks the build when CONFIG_VSX is not set.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# ad0a048b 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add OSI hypercall interface

MOL uses its own hypercall interface to call back into userspace when
the guest wants to do something.

So let's implement that as an exit reason, specify it with a CAP and
only really use it when userspace wants us to.

The only user of it so far is MOL.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# ca7f4203 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Implement alignment interrupt

Mac OS X has some applications - namely the Finder - that require alignment
interrupts to work properly. So we need to implement them.

But the spec for 970 and 750 also looks different. While 750 requires the
DSISR and DAR fields to reflect some instruction bits (DSISR) and the fault
address (DAR), the 970 declares this as an optional feature. So we need
to reconstruct DSISR and DAR manually.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# a56cf347 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Load VCPU for register fetching

When trying to read or store vcpu register data, we should also make
sure the vcpu is actually loaded, so we're 100% sure we get the correct
values.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# c2453693 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Don't reload FPU with invalid values

When the guest activates the FPU, we load it up. That's fine when
it wasn't activated before on the host, but if it was we end up
reloading FPU values from last time the FPU was deactivated on the
host without writing the proper values back to the vcpu struct.

This patch checks if the FPU is enabled already and if so just doesn't
bother activating it, making FPU operations survive guest context switches.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 8963221d 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Split instruction reading out

The current check_ext function reads the instruction and then does
the checking. Let's split the reading out so we can reuse it for
different functions.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 18978768 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Allow userspace to unset the IRQ line

Userspace can tell us that it wants to trigger an interrupt. But
so far it can't tell us that it wants to stop triggering one.

So let's interpret the parameter to the ioctl that we have anyways
to tell us if we want to raise or lower the interrupt line.

Signed-off-by: Alexander Graf <agraf@suse.de>

v2 -> v3:

- Add CAP for unset irq
Signed-off-by: Avi Kivity <avi@redhat.com>


# 3eeafd7d 24-Mar-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Ensure split mode works

On PowerPC we can go into MMU Split Mode. That means that either
data relocation is on but instruction relocation is off or vice
versa.

That mode didn't work properly, as we weren't always flushing
entries when going into a new split mode, potentially mapping
different code or data that we're supposed to.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7e821d39 22-Feb-2010 Alexander Graf <alex@csgraf.de>

KVM: PPC: Memset vcpu to zeros

While converting the kzalloc we used to allocate our vcpu struct to
vmalloc, I forgot to memset the contents to zeros. That broke quite
a lot.

This patch memsets it to zero again.

Signed-off-by: Alexander Graf <alex@csgraf.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 032c3407 18-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Allocate vcpu struct using vmalloc

We used to use get_free_pages to allocate our vcpu struct. Unfortunately
that call failed on me several times after my machine had a big enough
uptime, as memory became too fragmented by then.

Fortunately, we don't need it to be page aligned any more! We can just
vmalloc it and everything's great.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# e5c29e92 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Enable program interrupt to do MMIO

When we get a program interrupt we usually don't expect it to perform an
MMIO operation. But why not? When we emulate paired singles, we can end
up loading or storing to an MMIO address - and the handling of those
happens in the program interrupt handler.

So let's teach the program interrupt handler how to deal with EMULATE_MMIO.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# aba3bd7f 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Make ext giveup non-static

We need to call the ext giveup handlers from code outside of book3s.c.
So let's make it non-static.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 5467a97d 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Make software load/store return eaddr

The Book3S KVM implementation contains some helper functions to load and store
data from and to virtual addresses.

Unfortunately, this helper used to keep the physical address it so nicely
found out for us to itself. So let's change that and make it return the
physical address it resolved.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# d1bab74c 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Preload FPU when possible

There are some situations when we're pretty sure the guest will use the
FPU soon. So we can save the churn of going into the guest, finding out
it does want to use the FPU and going out again.

This patch adds preloading of the FPU when it's reasonable.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# c8c0b6f2 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Combine extension interrupt handlers

When we for example get an Altivec interrupt, but our guest doesn't support
altivec, we need to inject a program interrupt, not an altivec interrupt.

The same goes for paired singles. When an altivec interrupt arrives, we're
pretty sure we need to emulate the instruction because it's a paired single
operation.

So let's make all the ext handlers aware that they need to jump to the
program interrupt handler when an extension interrupt arrives that
was not supposed to arrive for the guest CPU.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 3c402a75 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add hidden flag for paired singles

The Gekko implements an extension called paired singles. When the guest wants
to use that extension, we need to make sure we're not running the host FPU,
because all FPU instructions need to get emulated to accomodate for additional
operations that occur.

This patch adds an hflag to track if we're in paired single mode or not.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 37f5bca6 19-Feb-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add AGAIN type for emulation return

Emulation of an instruction can have different outcomes. It can succeed,
fail, require MMIO, do funky BookE stuff - or it can just realize something's
odd and will be fixed the next time around.

Exactly that is what EMULATE_AGAIN means. Using that flag we can now tell
the caller that nothing happened, but we still want to go back to the
guest and see what happens next time we come around.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 87bf6e7d 12-Apr-2010 Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>

KVM: fix the handling of dirty bitmaps to avoid overflows

Int is not long enough to store the size of a dirty bitmap.

This patch fixes this problem with the introduction of a wrapper
function to calculate the sizes of dirty bitmaps.

Note: in mark_page_dirty(), we have to consider the fact that
__set_bit() takes the offset as int, not long.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# a76f8497 15-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Move Shadow MSR calculation to function

We keep a copy of the MSR around that we use when we go into the guest context.

That copy is basically the normal process MSR flags OR some allowed guest
specified MSR flags. We also AND the external providers into this, so we get
traps on FPU usage when we haven't activated it on the host yet.

Currently this calculation is part of the set_msr function that we use whenever
we set the guest MSR value. With the external providers, we also have the case
that we don't modify the guest's MSR, but only want to update the shadow MSR.

So let's move the shadow MSR parts to a separate function that we then use
whenever we only need to update it. That way we don't accidently kvm_vcpu_block
within a preempt notifier context.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# f7adbba1 15-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Keep SRR1 flags around in shadow_msr

SRR1 stores more information that just the MSR value. It also stores
valuable information about the type of interrupt we received, for
example whether the storage interrupt we just got was because of a
missing htab entry or not.

We use that information to speed up the exit path.

Now if we get preempted before we can interpret the shadow_msr values,
we get into vcpu_put which then calls the MSR handler, which then sets
all the SRR1 information bits in shadow_msr to 0. Great.

So let's preserve the SRR1 specific bits in shadow_msr whenever we set
the MSR. They don't hurt.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 180a34d2 15-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add support for FPU/Altivec/VSX

When our guest starts using either the FPU, Altivec or VSX we need to make
sure Linux knows about it and sneak into its process switching code
accordingly.

This patch makes accesses to the above parts of the system work inside the
VM.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# ff1ca3f9 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Pass program interrupt flags to the guest

When we need to reinject a program interrupt into the guest, we also need to
reinject the corresponding flags into the guest.

Signed-off-by: Alexander Graf <agraf@suse.de>
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 25a8a02d 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Emulate trap SRR1 flags properly

Book3S needs some flags in SRR1 to get to know details about an interrupt.

One such example is the trap instruction. It tells the guest kernel that
a program interrupt is due to a trap using a bit in SRR1.

This patch implements above behavior, making WARN_ON behave like WARN_ON.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 021ec9c6 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Call SLB patching code in interrupt safe manner

Currently we're racy when doing the transition from IR=1 to IR=0, from
the module memory entry code to the real mode SLB switching code.

To work around that I took a look at the RTAS entry code which is faced
with a similar problem and did the same thing:

A small helper in linear mapped memory that does mtmsr with IR=0 and
then RFIs info the actual handler.

Thanks to that trick we can safely take page faults in the entry code
and only need to be really wary of what to do as of the SLB switching
part.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7e57cba0 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Use PACA backed shadow vcpu

We're being horribly racy right now. All the entry and exit code hijacks
random fields from the PACA that could easily be used by different code in
case we get interrupted, for example by a #MC or even page fault.

After discussing this with Ben, we figured it's best to reserve some more
space in the PACA and just shove off some vcpu state to there.

That way we can drastically improve the readability of the code, make it
less racy and less complex.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 992b5b29 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Add helpers for CR, XER

We now have helpers for the GPRs, so let's also add some for CR and XER.

Having them in the PACA simplifies code a lot, as we don't need to care
about where to store CC or not to overflow any integers.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 8e5b26b5 07-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Use accessor functions for GPR access

All code in PPC KVM currently accesses gprs in the vcpu struct directly.

While there's nothing wrong with that wrt the current way gprs are stored
and loaded, it doesn't suffice for the PACA acceleration that will follow
in this patchset.

So let's just create little wrapper inline functions that we call whenever
a GPR needs to be read from or written to. The compiled code shouldn't really
change at all for now.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 97c4cfbe 04-Jan-2010 Alexander Graf <agraf@suse.de>

KVM: PPC: Enable lightweight exits again

The PowerPC C ABI defines that registers r14-r31 need to be preserved across
function calls. Since our exit handler is written in C, we can make use of that
and don't need to reload r14-r31 on every entry/exit cycle.

This technique is also used in the BookE code and is called "lightweight exits"
there. To follow the tradition, it's called the same in Book3S.

So far this optimization was disabled though, as the code didn't do what it was
expected to do, but failed to work.

This patch fixes and enables lightweight exits again.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 79fac95e 23-Dec-2009 Marcelo Tosatti <mtosatti@redhat.com>

KVM: convert slots_lock to a mutex

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 46a26bf5 23-Dec-2009 Marcelo Tosatti <mtosatti@redhat.com>

KVM: modify memslots layout in struct kvm

Have a pointer to an allocated region inside struct kvm.

[alex: fix ppc book 3s]

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 0bb1fb71 21-Dec-2009 Alexander Graf <agraf@suse.de>

KVM: powerpc: Remove AGGRESSIVE_DEC

Because we now emulate the DEC interrupt according to real life behavior,
there's no need to keep the AGGRESSIVE_DEC hack around.

Let's just remove it.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Acked-by: Hollis Blanchard <hollis@penguinppc.org>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 7706664d 21-Dec-2009 Alexander Graf <agraf@suse.de>

KVM: powerpc: Improve DEC handling

We treated the DEC interrupt like an edge based one. This is not true for
Book3s. The DEC keeps firing until mtdec is issued again and thus clears
the interrupt line.

So let's implement this logic in KVM too. This patch moves the line clearing
from the firing of the interrupt to the mtdec emulation.

This makes PPC64 guests work without AGGRESSIVE_DEC defined.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Acked-by: Hollis Blanchard <hollis@penguinppc.org>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 583617b7 21-Dec-2009 Alexander Graf <agraf@suse.de>

KVM: powerpc: Move vector to irqprio resolving to separate function

We're using a switch table to find the irqprio that belongs to a specific
interrupt vector. This table is part of the interrupt inject logic.

Since we'll add a new function to stop interrupts, let's move this table
out of the injection logic into a separate function.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Acked-by: Hollis Blanchard <hollis@penguinppc.org>
Signed-off-by: Avi Kivity <avi@redhat.com>


# e15a1137 29-Nov-2009 Alexander Graf <agraf@suse.de>

powerpc/kvm: Sync guest visible MMU state

Currently userspace has no chance to find out which virtual address space we're
in and resolve addresses. While that is a big problem for migration, it's also
unpleasent when debugging, as gdb and the monitor don't work on virtual
addresses.

This patch exports enough of the MMU segment state to userspace to make
debugging work and thus also includes the groundwork for migration.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 2f4cf5e4 29-Oct-2009 Alexander Graf <agraf@suse.de>

Add book3s.c

This adds the book3s core handling file. Here everything that is generic to
desktop PowerPC cores is handled, including interrupt injections, MSR settings,
etc.

It basically takes over the same role as booke.c for embedded PowerPCs.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>