#
c2ed087e |
|
20-Feb-2024 |
Madhavan Srinivasan <maddy@linux.ibm.com> |
powerpc: Add Power11 architected and raw mode Add CPU table entries for raw and architected mode. Most fields are copied from the Power10 table entries. CPU, MMU and user (ELF_HWCAP) features are unchanged vs P10. However userspace can detect P11 because the AT_PLATFORM value changes to "power11". The logical PVR value of 0x0F000007, passed to firmware via the ibm_arch_vec, indicates the kernel can support a P11 compatible CPU, which means at least ISA v3.1 compliant. Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240221044623.1598642-1-mpe@ellerman.id.au
|
#
20c8c4da |
|
06-Feb-2024 |
Amit Machhiwal <amachhiw@linux.ibm.com> |
KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat' Currently, rebooting a pseries nested qemu-kvm guest (L2) results in below error as L1 qemu sends PVR value 'arch_compat' == 0 via ppc_set_compat ioctl. This triggers a condition failure in kvmppc_set_arch_compat() resulting in an EINVAL. qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid argument Also, a value of 0 for arch_compat generally refers the default compatibility of the host. But, arch_compat, being a Guest Wide Element in nested API v2, cannot be set to 0 in GSB as PowerVM (L0) expects a non-zero value. A value of 0 triggers a kernel trap during a reboot and consequently causes it to fail: [ 22.106360] reboot: Restarting system KVM: unknown exit, hardware reason ffffffffffffffea NIP 0000000000000100 LR 000000000000fe44 CTR 0000000000000000 XER 0000000020040092 CPU#0 MSR 0000000000001000 HID0 0000000000000000 HF 6c000000 iidx 3 didx 3 TB 00000000 00000000 DECR 0 GPR00 0000000000000000 0000000000000000 c000000002a8c300 000000007fe00000 GPR04 0000000000000000 0000000000000000 0000000000001002 8000000002803033 GPR08 000000000a000000 0000000000000000 0000000000000004 000000002fff0000 GPR12 0000000000000000 c000000002e10000 0000000105639200 0000000000000004 GPR16 0000000000000000 000000010563a090 0000000000000000 0000000000000000 GPR20 0000000105639e20 00000001056399c8 00007fffe54abab0 0000000105639288 GPR24 0000000000000000 0000000000000001 0000000000000001 0000000000000000 GPR28 0000000000000000 0000000000000000 c000000002b30840 0000000000000000 CR 00000000 [ - - - - - - - - ] RES 000@ffffffffffffffff SRR0 0000000000000000 SRR1 0000000000000000 PVR 0000000000800200 VRSAVE 0000000000000000 SPRG0 0000000000000000 SPRG1 0000000000000000 SPRG2 0000000000000000 SPRG3 0000000000000000 SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000 HSRR0 0000000000000000 HSRR1 0000000000000000 CFAR 0000000000000000 LPCR 0000000000020400 PTCR 0000000000000000 DAR 0000000000000000 DSISR 0000000000000000 kernel:trap=0xffffffea | pc=0x100 | msr=0x1000 This patch updates kvmppc_set_arch_compat() to use the host PVR value if 'compat_pvr' == 0 indicating that qemu doesn't want to enforce any specific PVR compat mode. The relevant part of the code might need a rework if PowerVM implements a support for `arch_compat == 0` in nestedv2 API. Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests") Reviewed-by: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240207054526.3720087-1-amachhiw@linux.ibm.com
|
#
eed52e43 |
|
27-Oct-2023 |
Sean Christopherson <seanjc@google.com> |
KVM: Allow arch code to track number of memslot address spaces per VM Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
180c6b07 |
|
01-Dec-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception In the nestedv2 case, if there is a pending decrementer exception, the L1 must get the L2's timebase from the L0 to see if the exception should be cancelled. This adds the overhead of a H_GUEST_GET_STATE call to the likely case in which the decrementer should not be cancelled. Avoid this logic for the nestedv2 case. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231201132618.555031-13-vaibhav@linux.ibm.com
|
#
db1dcfae |
|
01-Dec-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV nestedv2: Register the VPA with the L0 In the nestedv2 case, the L1 may register the L2's VPA with the L0. This allows the L0 to manage the L2's dispatch count, as well as enable possible performance optimisations by seeing if certain resources are not being used by the L2 (such as the PMCs). Use the H_GUEST_SET_STATE call to inform the L0 of the L2's VPA address. This can not be done in the H_GUEST_VCPU_RUN input buffer. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231201132618.555031-11-vaibhav@linux.ibm.com
|
#
a9a3de53 |
|
01-Dec-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV nestedv2: Avoid msr check in kvmppc_handle_exit_hv() The msr check in kvmppc_handle_exit_hv() is not needed for nestedv2 hosts, skip the check to avoid a H_GUEST_GET_STATE hcall. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231201132618.555031-9-vaibhav@linux.ibm.com
|
#
ecd10702 |
|
01-Dec-2023 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Handle pending exceptions on guest entry with MSR_EE Commit 026728dc5d41 ("KVM: PPC: Book3S HV P9: Inject pending xive interrupts at guest entry") changed guest entry so that if external interrupts are enabled, BOOK3S_IRQPRIO_EXTERNAL is not tested for. Test for this regardless of MSR_EE. For an L1 host, do not inject an interrupt, but always use LPCR_MER. If the L0 desires it can inject an interrupt. Fixes: 026728dc5d41 ("KVM: PPC: Book3S HV P9: Inject pending xive interrupts at guest entry") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [jpn: use kvmpcc_get_msr(), write commit message] Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231201132618.555031-7-vaibhav@linux.ibm.com
|
#
ec0f6639 |
|
01-Dec-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0 LPCR_MER is conditionally set during entry to a guest if there is a pending external interrupt. In the nestedv2 case, this change is not being communicated to the L0, which means it is not being set in the L2. Ensure the updated LPCR value is passed to the L0. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231201132618.555031-6-vaibhav@linux.ibm.com
|
#
63ccae78 |
|
01-Dec-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV nestedv2: Do not check msr on hcalls The check for a hcall coming from userspace is done for KVM-PR. This is not supported for nestedv2 and the L0 will directly inject the necessary exception to the L2 if userspace performs a hcall. Avoid checking the MSR and thus avoid a H_GUEST_GET_STATE hcall in the L1. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231201132618.555031-4-vaibhav@linux.ibm.com
|
#
7d370e18 |
|
01-Dec-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not already been done. This is a slow operation that means H_GUEST_DELETE must return H_BUSY multiple times before completing. Invalidating the tables before deleting the guest so there is less work for the L0 to do. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231201132618.555031-2-vaibhav@linux.ibm.com
|
#
19d31c5f |
|
13-Sep-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Add support for nestedv2 guests A series of hcalls have been added to the PAPR which allow a regular guest partition to create and manage guest partitions of its own. KVM already had an interface that allowed this on powernv platforms. This existing interface will now be called "nestedv1". The newly added PAPR interface will be called "nestedv2". PHYP will support the nestedv2 interface. At this time the host side of the nestedv2 interface has not been implemented on powernv but there is no technical reason why it could not be added. The nestedv1 interface is still supported. Add support to KVM to utilize these hcalls to enable running nested guests as a pseries guest on PHYP. Overview of the new hcall usage: - L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES() - L1 requests the L0 create a L2 with H_GUEST_CREATE() and receives a handle to use in future hcalls - L1 requests the L0 create a L2 vCPU with H_GUEST_CREATE_VCPU() - L1 sets up the L2 using H_GUEST_SET and the H_GUEST_VCPU_RUN input buffer - L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN() - L2 returns to L1 with an exit reason and L1 reads the H_GUEST_VCPU_RUN output buffer populated by the L0 - L1 handles the exit using H_GET_STATE if necessary - L1 reruns L2 vCPU with H_GUEST_VCPU_RUN - L1 frees the L2 in the L0 with H_GUEST_DELETE() Support for the new API is determined by trying H_GUEST_GET_CAPABILITIES. On a successful return, use the nestedv2 interface. Use the vcpu register state setters for tracking modified guest state elements and copy the thread wide values into the H_GUEST_VCPU_RUN input buffer immediately before running a L2. The guest wide elements can not be added to the input buffer so send them with a separate H_GUEST_SET call if necessary. Make the vcpu register getter load the corresponding value from the real host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track which values have already been loaded between H_GUEST_VCPU_RUN calls. If an element is present in the H_GUEST_VCPU_RUN output buffer it also does not need to be loaded again. Tested-by: Sachin Sant <sachinp@linux.ibm.com> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com> Signed-off-by: Amit Machhiwal <amachhiw@linux.vnet.ibm.com> Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230914030600.16993-11-jniethe5@gmail.com
|
#
6de2e837 |
|
13-Sep-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV: Introduce low level MSR accessor kvmppc_get_msr() and kvmppc_set_msr_fast() serve as accessors for the MSR. However because the MSR is kept in the shared regs they include a conditional check for kvmppc_shared_big_endian() and endian conversion. Within the Book3S HV specific code there are direct reads and writes of shregs::msr. In preparation for Nested APIv2 these accesses need to be replaced with accessor functions so it is possible to extend their behavior. However, using the kvmppc_get_msr() and kvmppc_set_msr_fast() functions is undesirable because it would introduce a conditional branch and endian conversion that is not currently present. kvmppc_set_msr_hv() already exists, it is used for the kvmppc_ops::set_msr callback. Introduce a low level accessor __kvmppc_{s,g}et_msr_hv() that simply gets and sets shregs::msr. This will be extend for Nested APIv2 support. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230914030600.16993-8-jniethe5@gmail.com
|
#
ebc88ea7 |
|
13-Sep-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV: Use accessors for VCPU registers Introduce accessor generator macros for Book3S HV VCPU registers. Use the accessor functions to replace direct accesses to this registers. This will be important later for Nested APIv2 support which requires additional functionality for accessing and modifying VCPU state. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230914030600.16993-7-jniethe5@gmail.com
|
#
c8ae9b3c |
|
13-Sep-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Use accessors for VCORE registers Introduce accessor generator macros for VCORE registers. Use the accessor functions to replace direct accesses to this registers. This will be important later for Nested APIv2 support which requires additional functionality for accessing and modifying VCPU state. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230914030600.16993-6-jniethe5@gmail.com
|
#
7028ac8d |
|
13-Sep-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Use accessors for VCPU registers Introduce accessor generator macros for VCPU registers. Use the accessor functions to replace direct accesses to this registers. This will be important later for Nested APIv2 support which requires additional functionality for accessing and modifying VCPU state. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230914030600.16993-5-jniethe5@gmail.com
|
#
0e85b7df |
|
13-Sep-2023 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Always use the GPR accessors Always use the GPR accessor functions. This will be important later for Nested APIv2 support which requires additional functionality for accessing and modifying VCPU state. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230914030600.16993-2-jniethe5@gmail.com
|
#
67c48662e |
|
08-Feb-2023 |
Thomas Huth <thuth@redhat.com> |
KVM: PPC: Standardize on "int" return types in the powerpc KVM code Most functions that are related to kvm_arch_vm_ioctl() already use "int" as return type to pass error values back to the caller. Some outlier functions use "long" instead for no good reason (they do not really require long values here). Let's standardize on "int" here to avoid casting the values back and forth between the two types. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20230208140105.655814-2-thuth@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
a3800ef9 |
|
07-Mar-2023 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Enable prefixed instructions for HV KVM and disable for PR KVM Now that we can read prefixed instructions from a HV KVM guest and emulate prefixed load/store instructions to emulated MMIO locations, we can add HFSCR_PREFIXED into the set of bits that are set in the HFSCR for a HV KVM guest on POWER10, allowing the guest to use prefixed instructions. PR KVM has not yet been extended to handle prefixed instructions in all situations where we might need to emulate them, so prevent the guest from enabling prefixed instructions in the FSCR for now. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/ZAgs25dCmLrVkBdU@cleo
|
#
953e3739 |
|
07-Mar-2023 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Fetch prefixed instructions from the guest In order to handle emulation of prefixed instructions in the guest, this first makes vcpu->arch.last_inst be an unsigned long, i.e. 64 bits on 64-bit platforms. For prefixed instructions, the upper 32 bits are used for the prefix and the lower 32 bits for the suffix, and both halves are byte-swapped if the guest endianness differs from the host. Next, vcpu->arch.emul_inst is now 64 bits wide, to match the HEIR register on POWER10. Like HEIR, for a prefixed instruction it is defined to have the prefix is in the top 32 bits and the suffix in the bottom 32 bits, with both halves in the correct byte order. kvmppc_get_last_inst is extended on 64-bit machines to put the prefix and suffix in the right places in the ppc_inst_t being returned. kvmppc_load_last_inst now returns the instruction in an unsigned long in the same format as vcpu->arch.last_inst. It makes the decision about whether to fetch a suffix based on the SRR1_PREFIXED bit in the MSR image stored in the vcpu struct, which generally comes from SRR1 or HSRR1 on an interrupt. This bit is defined in Power ISA v3.1B to be set if the interrupt occurred due to a prefixed instruction and cleared otherwise for all interrupts except for instruction storage interrupt, which does not come to the hypervisor. It is set to zero for asynchronous interrupts such as external interrupts. In previous ISA versions it was always set to 0 for all interrupts except instruction storage interrupt. The code in book3s_hv_rmhandlers.S that loads the faulting instruction on a HDSI is only used on POWER8 and therefore doesn't ever need to load a suffix. [npiggin@gmail.com - check that the is-prefixed bit in SRR1 matches the type of instruction that was fetched.] Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/ZAgsq9h1CCzouQuV@cleo
|
#
acf17878 |
|
07-Mar-2023 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Make kvmppc_get_last_inst() produce a ppc_inst_t This changes kvmppc_get_last_inst() so that the instruction it fetches is returned in a ppc_inst_t variable rather than a u32. This will allow us to return a 64-bit prefixed instruction on those 64-bit machines that implement Power ISA v3.1 or later, such as POWER10. On 32-bit platforms, ppc_inst_t is 32 bits wide, and is turned back into a u32 by ppc_inst_val, which is an identity operation on those platforms. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/ZAgsiPlL9O7KnlZZ@cleo
|
#
6cd5c1db |
|
30-Mar-2023 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Set SRR1[PREFIX] bit on injected interrupts Pass the hypervisor (H)SRR1[PREFIX] indication through to synchronous interrupts injected into the guest. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230330103224.3589928-3-npiggin@gmail.com
|
#
460ba21d |
|
30-Mar-2023 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Permit SRR1 flags in more injected interrupt types The prefix architecture in ISA v3.1 introduces a prefixed bit in SRR1 for many types of synchronous interrupts which is set when the interrupt is caused by a prefixed instruction. This requires KVM to be able to set this bit when injecting interrupts into a guest. Plumb through the SRR1 "flags" argument to the core_queue APIs where it's missing for this. For now they are set to 0, which is no change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Fixup kvmppc_core_queue_alignment() in booke.c] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230330103224.3589928-2-npiggin@gmail.com
|
#
4c8c3c7f |
|
07-Mar-2023 |
Valentin Schneider <vschneid@redhat.com> |
treewide: Trace IPIs sent via smp_send_reschedule() To be able to trace invocations of smp_send_reschedule(), rename the arch-specific definitions of it to arch_smp_send_reschedule() and wrap it into an smp_send_reschedule() that contains a tracepoint. Changes to include the declaration of the tracepoint were driven by the following coccinelle script: @func_use@ @@ smp_send_reschedule(...); @include@ @@ #include <trace/events/ipi.h> @no_include depends on func_use && !include@ @@ #include <...> + + #include <trace/events/ipi.h> [csky bits] [riscv bits] Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com
|
#
e4335f53 |
|
08-Sep-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Implement scheduling wait interval counters in the VPA PAPR specifies accumulated virtual processor wait intervals that relate to partition scheduling interval times. Implement these counters in the same way as they are repoted by dtl. Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-5-npiggin@gmail.com
|
#
1a5486b3 |
|
08-Sep-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Restore stolen time logging in dtl Stolen time logging in dtl was removed from the P9 path, so guests had no stolen time accounting. Add it back in a simpler way that still avoids locks and per-core accounting code. Fixes: ecb6a7207f92 ("KVM: PPC: Book3S HV P9: Remove most of the vcore logic") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-4-npiggin@gmail.com
|
#
b31bc24a |
|
08-Sep-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Update guest state entry/exit accounting to new API Update the guest state and timing entry/exit accounting to use the new API, which was introduced following issues found[1]. KVM HV does possibly call instrumented code inside the guest context, and it does call srcu inside the guest context which is fragile at best. Switch to the new API, moving the guest context inside the srcu_read_lock/unlock region. [1] https://lore.kernel.org/lkml/20220201132926.3301912-1-mark.rutland@arm.com/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-3-npiggin@gmail.com
|
#
c953f750 |
|
08-Sep-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Fix irq disabling in tick accounting kvmhv_run_single_vcpu() disables PMIs as well as Linux irqs, however the tick time accounting code enables and disables irqs and not PMIs within this region. By chance this might not actually cause a bug, but it is clearly an incorrect use of the APIs. Fixes: 2251fbe76395e ("KVM: PPC: Book3S HV P9: Improve mtmsrd scheduling by delaying MSR[EE] disable") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-2-npiggin@gmail.com
|
#
bc91c04b |
|
08-Sep-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Clear vcpu cpu fields before enabling host irqs On guest entry, vcpu->cpu and vcpu->arch.thread_cpu are set after disabling host irqs. On guest exit there is a window whre tick time accounting briefly enables irqs before these fields are cleared. Move them up to ensure they are cleared before host irqs are run. This is possibly not a problem, but is more symmetric and makes the fields less surprising. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220908132545.4085849-1-npiggin@gmail.com
|
#
0a5bfb82 |
|
16-Aug-2022 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Fix decrementer migration We used to have a workaround[1] for a hang during migration that was made ineffective when we converted the decrementer expiry to be relative to guest timebase. The point of the workaround was that in the absence of an explicit decrementer expiry value provided by userspace during migration, KVM needs to initialize dec_expires to a value that will result in an expired decrementer after subtracting the current guest timebase. That stops the vcpu from hanging after migration due to a decrementer that's too large. If the dec_expires is now relative to guest timebase, its initialization needs to be guest timebase-relative as well, otherwise we end up with a decrementer expiry that is still larger than the guest timebase. 1- https://git.kernel.org/torvalds/c/5855564c8ab2 Fixes: 3c1a4322bba7 ("KVM: PPC: Book3S HV: Change dec_expires to be relative to guest timebase") Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220816222517.1916391-1-farosas@linux.ibm.com
|
#
2b461880 |
|
18-Jul-2022 |
Michael Ellerman <mpe@ellerman.id.au> |
powerpc: Fix all occurences of duplicate words Since commit 87c78b612f4f ("powerpc: Fix all occurences of "the the"") fixed "the the", there's now a steady stream of patches fixing other duplicate words. Just fix them all at once, to save the overhead of dealing with individual patches for each case. This leaves a few cases of "that that", which in some contexts is correct. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220718095158.326606-1-mpe@ellerman.id.au
|
#
b44bb1b7 |
|
25-May-2022 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Provide more detailed timings for P9 entry path Alter the data collection points for the debug timing code in the P9 path to be more in line with what the code does. The points where we accumulate time are now the following: vcpu_entry: From vcpu_run_hv entry until the start of the inner loop; guest_entry: From the start of the inner loop until the guest entry in asm; in_guest: From the guest entry in asm until the return to KVM C code; guest_exit: From the return into KVM C code until the corresponding hypercall/page fault handling or re-entry into the guest; hypercall: Time spent handling hcalls in the kernel (hcalls can go to QEMU, not accounted here); page_fault: Time spent handling page faults; vcpu_exit: vcpu_run_hv exit (almost no code here currently). Like before, these are exposed in debugfs in a file called "timings". There are four values: - number of occurrences of the accumulation point; - total time the vcpu spent in the phase in ns; - shortest time the vcpu spent in the phase in ns; - longest time the vcpu spent in the phase in ns; === Before: rm_entry: 53132 16793518 256 4060 rm_intr: 53132 2125914 22 340 rm_exit: 53132 24108344 374 2180 guest: 53132 40980507996 404 9997650 cede: 0 0 0 0 After: vcpu_entry: 34637 7716108 178 4416 guest_entry: 52414 49365608 324 747542 in_guest: 52411 40828715840 258 9997480 guest_exit: 52410 19681717182 826 102496674 vcpu_exit: 34636 1744462 38 182 hypercall: 45712 22878288 38 1307962 page_fault: 992 111104034 568 168688 With just one instruction (hcall): vcpu_entry: 1 942 942 942 guest_entry: 1 4044 4044 4044 in_guest: 1 1540 1540 1540 guest_exit: 1 3542 3542 3542 vcpu_exit: 1 80 80 80 hypercall: 0 0 0 0 page_fault: 0 0 0 0 === Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220525130554.2614394-6-farosas@linux.ibm.com
|
#
c3fa64c9 |
|
25-May-2022 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Decouple the debug timing from the P8 entry path We are currently doing the timing for debug purposes of the P9 entry path using the accumulators and terminology defined by the old entry path for P8 machines. Not only the "real-mode" and "napping" mentions are out of place for the P9 Radix entry path but also we cannot change them because the timing code is coupled to the structures defined in struct kvm_vcpu_arch. Add a new CONFIG_KVM_BOOK3S_HV_P9_TIMING to enable the timing code for the P9 entry path. For now, just add the new CONFIG and duplicate the structures. A subsequent patch will add the P9 changes. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220525130554.2614394-4-farosas@linux.ibm.com
|
#
d349ab99 |
|
16-Jul-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
random: handle archrandom with multiple longs The archrandom interface was originally designed for x86, which supplies RDRAND/RDSEED for receiving random words into registers, resulting in one function to generate an int and another to generate a long. However, other architectures don't follow this. On arm64, the SMCCC TRNG interface can return between one and three longs. On s390, the CPACF TRNG interface can return arbitrary amounts, with four longs having the same cost as one. On UML, the os_getrandom() interface can return arbitrary amounts. So change the api signature to take a "max_longs" parameter designating the maximum number of longs requested, and then return the number of longs generated. Since callers need to check this return value and loop anyway, each arch implementation does not bother implementing its own loop to try again to fill the maximum number of longs. Additionally, all existing callers pass in a constant max_longs parameter. Taken together, these two things mean that the codegen doesn't really change much for one-word-at-a-time platforms, while performance is greatly improved on platforms such as s390. Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
ad55bae7 |
|
28-Mar-2022 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Fix vcore_blocked tracepoint We removed most of the vcore logic from the P9 path but there's still a tracepoint that tried to dereference vc->runner. Fixes: ecb6a7207f92 ("KVM: PPC: Book3S HV P9: Remove most of the vcore logic") Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220328215831.320409-1-farosas@linux.ibm.com
|
#
cad32d9d |
|
05-May-2022 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers LoPAPR defines guest visible IOMMU with hypercalls to use it - H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap in the KVM in the real mode (with MMU off). The problem with the real mode is some memory is not available and some API usage crashed the host but enabling MMU was an expensive operation. The problems with the real mode handlers are: 1. Occasionally these cannot complete the request so the code is copied+modified to work in the virtual mode, very little is shared; 2. The real mode handlers have to be linked into vmlinux to work; 3. An exception in real mode immediately reboots the machine. If the small DMA window is used, the real mode handlers bring better performance. However since POWER8, there has always been a bigger DMA window which VMs use to map the entire VM memory to avoid calling H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT (a bulk version of H_PUT_TCE) which virtual mode handler is even closer to its real mode version. On POWER9 hypercalls trap straight to the virtual mode so the real mode handlers never execute on POWER9 and later CPUs. So with the current use of the DMA windows and MMU improvements in POWER9 and later, there is no point in duplicating the code. The 32bit passed through devices may slow down but we do not have many of these in practice. For example, with this applied, a 1Gbit ethernet adapter still demostrates above 800Mbit/s of actual throughput. This removes the real mode handlers from KVM and related code from the powernv platform. This updates the list of implemented hcalls in KVM-HV as the realmode handlers are removed. This changes ABI - kvmppc_h_get_tce() moves to the KVM module and kvmppc_find_table() is static now. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220506053755.3820702-1-aik@ozlabs.ru
|
#
1d1cd0f1 |
|
25-Apr-2022 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Initialize AMOR in nested entry The hypervisor always sets AMOR to ~0, but let's ensure we're not passing stale values around. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220425142151.1495142-1-farosas@linux.ibm.com
|
#
2852ebfa |
|
02-Mar-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV Nested: L2 LPCR should inherit L1 LPES setting The L1 should not be able to adjust LPES mode for the L2. Setting LPES if the L0 needs it clear would cause external interrupts to be sent to L2 and missed by the L0. Clearing LPES when it may be set, as typically happens with XIVE enabled could cause a performance issue despite having no native XIVE support in the guest, because it will cause mediated interrupts for the L2 to be taken in HV mode, which then have to be injected. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220303053315.1056880-7-npiggin@gmail.com
|
#
11681b79 |
|
02-Mar-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV Nested: L2 must not run with L1 xive context The PowerNV L0 currently pushes the OS xive context when running a vCPU, regardless of whether it is running a nested guest. The problem is that xive OS ring interrupts will be delivered while the L2 is running. At the moment, by default, the L2 guest runs with LPCR[LPES]=0, which actually makes external interrupts go to the L0. That causes the L2 to exit and the interrupt taken or injected into the L1, so in some respects this behaves like an escalation. It's not clear if this was deliberate or not, there's no comment about it and the L1 is actually allowed to clear LPES in the L2, so it's confusing at best. When the L2 is running, the L1 is essentially in a ceded state with respect to external interrupts (it can't respond to them directly and won't get scheduled again absent some additional event). So the natural way to solve this is when the L0 handles a H_ENTER_NESTED hypercall to run the L2, have it arm the escalation interrupt and don't push the L1 context while running the L2. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220303053315.1056880-6-npiggin@gmail.com
|
#
42b4a2b3 |
|
02-Mar-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Split !nested case out from guest entry The differences between nested and !nested will become larger in later changes so split them out for readability. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220303053315.1056880-5-npiggin@gmail.com
|
#
ad5ace91 |
|
02-Mar-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move cede logic out of XIVE escalation rearming Move the cede abort logic out of xive escalation rearming and into the caller to prepare for handling a similar case with nested guest entry. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220303053315.1056880-4-npiggin@gmail.com
|
#
026728dc |
|
02-Mar-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Inject pending xive interrupts at guest entry If there is a pending xive interrupt, inject it at guest entry (if MSR[EE] is enabled) rather than take another interrupt when the guest is entered. If xive is enabled then LPCR[LPES] is set so this behaviour should be expected. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220303053315.1056880-3-npiggin@gmail.com
|
#
86160461 |
|
22-Jan-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: HFSCR[PREFIX] does not exist This facility is controlled by FSCR only. Reserved bits should not be set in the HFSCR register (although it's likely harmless as this position would not be re-used, and the L0 is forgiving here too). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220122105639.3477407-1-npiggin@gmail.com
|
#
e6f6390a |
|
08-Mar-2022 |
Christophe Leroy <christophe.leroy@csgroup.eu> |
powerpc: Add missing headers Don't inherit headers "by chances" from asm/prom.h, asm/mpc52xx.h, asm/pci.h etc... Include the needed headers, and remove asm/prom.h when it was needed exclusively for pulling necessary headers. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/be8bdc934d152a7d8ee8d1a840d5596e2f7d85e0.1646767214.git.christophe.leroy@csgroup.eu
|
#
c7fa848f |
|
02-Mar-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Fix "lost kick" race When new work is created that requires attention from the hypervisor (e.g., to inject an interrupt into the guest), fast_vcpu_kick is used to pull the target vcpu out of the guest if it may have been running. Therefore the work creation side looks like this: vcpu->arch.doorbell_request = 1; kvmppc_fast_vcpu_kick_hv(vcpu) { smp_mb(); cpu = vcpu->cpu; if (cpu != -1) send_ipi(cpu); } And the guest entry side *should* look like this: vcpu->cpu = smp_processor_id(); smp_mb(); if (vcpu->arch.doorbell_request) { // do something (abort entry or inject doorbell etc) } But currently the store and load are flipped, so it is possible for the entry to see no doorbell pending, and the doorbell creation misses the store to set cpu, resulting lost work (or at least delayed until the next guest exit). Fix this by reordering the entry operations and adding a smp_mb between them. The P8 path appears to have a similar race which is commented but not addressed yet. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220303053315.1056880-2-npiggin@gmail.com
|
#
175be7e5 |
|
24-Jan-2022 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Free allocated memory if module init fails The module's exit function is not called when the init fails, we need to do cleanup before returning. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220125155735.1018683-4-farosas@linux.ibm.com
|
#
c5d0d77b |
|
24-Jan-2022 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Delay setting of kvm ops Delay the setting of kvm_hv_ops until after all init code has completed. This avoids leaving the ops still accessible if the init fails. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220125155735.1018683-3-farosas@linux.ibm.com
|
#
69ab6ac3 |
|
24-Jan-2022 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Check return value of kvmppc_radix_init The return of the function is being shadowed by the call to kvmppc_uvmem_init. Fixes: ca9f4942670c ("KVM: PPC: Book3S HV: Support for running secure guests") Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220125155735.1018683-2-farosas@linux.ibm.com
|
#
faf01aef |
|
10-Jan-2022 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
KVM: PPC: Merge powerpc's debugfs entry content into generic entry At the moment KVM on PPC creates 4 types of entries under the kvm debugfs: 1) "%pid-%fd" per a KVM instance (for all platforms); 2) "vm%pid" (for PPC Book3s HV KVM); 3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM); 4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS); The problem with this is that multiple VMs per process is not allowed for 2) and 3) which makes it possible for userspace to trigger errors when creating duplicated debugfs entries. This merges all these into 1). This defines kvm_arch_create_kvm_debugfs() similar to kvm_arch_create_vcpu_debugfs(). This defines 2 hooks in kvmppc_ops that allow specific KVM implementations add necessary entries, this adds the _e500 suffix to kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for. This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC. This removes no more used debugfs_dir pointers from PPC kvm_arch structs. This stops removing vcpu entries as once created vcpus stay around for the entire life of a VM and removed when the KVM instance is closed, see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU debugfs directories"). Suggested-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220111005404.162219-1-aik@ozlabs.ru
|
#
22f7ff0d |
|
22-Jan-2022 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV Nested: Fix nested HFSCR being clobbered with multiple vCPUs The L0 is storing HFSCR requested by the L1 for the L2 in struct kvm_nested_guest when the L1 requests a vCPU enter L2. kvm_nested_guest is not a per-vCPU structure. Hilarity ensues. Fix it by moving the nested hfscr into the vCPU structure together with the other per-vCPU nested fields. Fixes: 8b210a880b35 ("KVM: PPC: Book3S HV Nested: Make nested HFSCR state accessible") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220122105530.3477250-1-npiggin@gmail.com
|
#
a54d8066 |
|
06-Dec-2021 |
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> |
KVM: Keep memslots in tree-based structures instead of array-based ones The current memslot code uses a (reverse gfn-ordered) memslot array for keeping track of them. Because the memslot array that is currently in use cannot be modified every memslot management operation (create, delete, move, change flags) has to make a copy of the whole array so it has a scratch copy to work on. Strictly speaking, however, it is only necessary to make copy of the memslot that is being modified, copying all the memslots currently present is just a limitation of the array-based memslot implementation. Two memslot sets, however, are still needed so the VM continues to run on the currently active set while the requested operation is being performed on the second, currently inactive one. In order to have two memslot sets, but only one copy of actual memslots it is necessary to split out the memslot data from the memslot sets. The memslots themselves should be also kept independent of each other so they can be individually added or deleted. These two memslot sets should normally point to the same set of memslots. They can, however, be desynchronized when performing a memslot management operation by replacing the memslot to be modified by its copy. After the operation is complete, both memslot sets once again point to the same, common set of memslot data. This commit implements the aforementioned idea. For tracking of gfns an ordinary rbtree is used since memslots cannot overlap in the guest address space and so this data structure is sufficient for ensuring that lookups are done quickly. The "last used slot" mini-caches (both per-slot set one and per-vCPU one), that keep track of the last found-by-gfn memslot, are still present in the new code. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
|
#
eaaaed13 |
|
06-Dec-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: PPC: Avoid referencing userspace memory region in memslot updates For PPC HV, get the number of pages directly from the new memslot instead of computing the same from the userspace memory region, and explicitly check for !DELETE instead of inferring the same when toggling mmio_update. The motivation for these changes is to avoid referencing the @mem param so that it can be dropped in a future commit. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <1e97fb5198be25f98ef82e63a8d770c682264cc9.1638817639.git.maciej.szmigiero@oracle.com>
|
#
537a17b3 |
|
06-Dec-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: Let/force architectures to deal with arch specific memslot data Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch code to handle propagating arch specific data from "new" to "old" when necessary. This is a baby step towards dynamically allocating "new" from the get go, and is a (very) minor performance boost on x86 due to not unnecessarily copying arch data. For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE and FLAGS_ONLY. This is functionally a nop as the previous behavior would overwrite the pointer for CREATE, and eventually discard/ignore it for DELETE. For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV, x86 needs to reallocate arch data in the MOVE case as the size of x86's allocations depend on the alignment of the memslot's gfn. Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to match the "commit" prototype. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [mss: add missing RISCV kvm_arch_prepare_memory_region() change] Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>
|
#
46808a4c |
|
16-Nov-2021 |
Marc Zyngier <maz@kernel.org> |
KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu index. Unfortunately, we're about to move rework the iterator, which requires this to be upgrade to an unsigned long. Let's bite the bullet and repaint all of it in one go. Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-7-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
63fa47ba |
|
13-Dec-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: PPC: Book3S HV P9: Use kvm_arch_vcpu_get_wait() to get rcuwait object Use kvm_arch_vcpu_get_wait() to get a vCPU's rcuwait object instead of using vcpu->wait directly in kvmhv_run_single_vcpu(). Functionally, this is a nop as vcpu->arch.waitp is guaranteed to point at vcpu->wait. But that is not obvious at first glance, and a future change coming in via the KVM tree, commit 510958e99721 ("KVM: Force PPC to define its own rcuwait object"), will hide vcpu->wait from architectures that define __KVM_HAVE_ARCH_WQP to prevent generic KVM from attepting to wake a vCPU with the wrong rcuwait object. Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211213174556.3871157-1-seanjc@google.com
|
#
511d25d6 |
|
01-Sep-2021 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
KVM: PPC: Book3S: Suppress warnings when allocating too big memory slots The userspace can trigger "vmalloc size %lu allocation failure: exceeds total pages" via the KVM_SET_USER_MEMORY_REGION ioctl. This silences the warning by checking the limit before calling vzalloc() and returns ENOMEM if failed. This does not call underlying valloc helpers as __vmalloc_node() is only exported when CONFIG_TEST_VMALLOC_MODULE and __vmalloc_node_range() is not exported at all. Spotted by syzkaller. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [mpe: Use 'size' for the variable rather than 'cb'] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210901084512.1658628-1-aik@ozlabs.ru
|
#
9c5a432a |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Remove subcore HMI handling On POWER9 and newer, rather than the complex HMI synchronisation and subcore state, have each thread un-apply the guest TB offset before calling into the early HMI handler. This allows the subcore state to be avoided, including subcore enter / exit guest, which includes an expensive divide that shows up slightly in profiles. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-54-npiggin@gmail.com
|
#
6398326b |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Stop using vc->dpdes The P9 path uses vc->dpdes only for msgsndp / SMT emulation. This adds an ordering requirement between vcpu->doorbell_request and vc->dpdes for no real benefit. Use vcpu->doorbell_request directly. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-53-npiggin@gmail.com
|
#
617326ff |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Tidy kvmppc_create_dtl_entry This goes further to removing vcores from the P9 path. Also avoid the memset in favour of explicitly initialising all fields. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-52-npiggin@gmail.com
|
#
ecb6a720 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Remove most of the vcore logic The P9 path always uses one vcpu per vcore, so none of the vcore, locks, stolen time, blocking logic, shared waitq, etc., is required. Remove most of it. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-51-npiggin@gmail.com
|
#
434398ab |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Avoid cpu_in_guest atomics on entry and exit cpu_in_guest is set to determine if a CPU needs to be IPI'ed to exit the guest and notice the need_tlb_flush bit. This can be implemented as a global per-CPU pointer to the currently running guest instead of per-guest cpumasks, saving 2 atomics per entry/exit. P7/8 doesn't require cpu_in_guest, nor does a nested HV (only the L0 does), so move it to the P9 HV path. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-50-npiggin@gmail.com
|
#
4c9a6891 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Add unlikely annotation for !mmu_ready The mmu will almost always be ready. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-49-npiggin@gmail.com
|
#
b49c65c5 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Improve mfmsr performance on entry Rearrange the MSR saving on entry so it does not follow the mtmsrd to disable interrupts, avoiding a possible RAW scoreboard stall. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-46-npiggin@gmail.com
|
#
46dea77f |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV Nested: Avoid extra mftb() in nested entry mftb() is expensive and one can be avoided on nested guest dispatch. If the time checking code distinguishes between the L0 timer and the nested HV timer, then both can be tested in the same place with the same mftb() value. This also nicely illustrates the relationship between the L0 and nested HV timers. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-45-npiggin@gmail.com
|
#
d5c0e833 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Avoid tlbsync sequence on radix guest exit Use the existing TLB flushing logic to IPI the previous CPU and run the necessary barriers before running a guest vCPU on a new physical CPU, to do the necessary radix GTSE barriers for handling the case of an interrupted guest tlbie sequence. This requires the vCPU TLB flush sequence that is currently just done on one thread, to be expanded to ensure the other threads execute a ptesync, because causing them to exit the guest will no longer cause a ptesync by itself. This results in more IPIs than the TLB flush logic requires, but it's a significant win for common case scheduling when the vCPU remains on the same physical CPU. This saves about 520 cycles (nearly 10%) on a guest entry+exit micro benchmark on a POWER9. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-44-npiggin@gmail.com
|
#
a089a686 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Don't restore PSSCR if not needed This also moves the PSSCR update in nested entry to avoid a SPR scoreboard stall. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-42-npiggin@gmail.com
|
#
5236756d |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Use Linux SPR save/restore to manage some host SPRs Linux implements SPR save/restore including storage space for registers in the task struct for process context switching. Make use of this similarly to the way we make use of the context switching fp/vec save restore. This improves code reuse, allows some stack space to be saved, and helps with avoiding VRSAVE updates if they are not required. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-39-npiggin@gmail.com
|
#
022ecb96 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Demand fault TM facility registers Use HFSCR facility disabling to implement demand faulting for TM, with a hysteresis counter similar to the load_fp etc counters in context switching that implement the equivalent demand faulting for userspace facilities. This speeds up guest entry/exit by avoiding the register save/restore when a guest is not frequently using them. When a guest does use them often, there will be some additional demand fault overhead, but these are not commonly used facilities. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-38-npiggin@gmail.com
|
#
a3e18ca8 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Demand fault EBB facility registers Use HFSCR facility disabling to implement demand faulting for EBB, with a hysteresis counter similar to the load_fp etc counters in context switching that implement the equivalent demand faulting for userspace facilities. This speeds up guest entry/exit by avoiding the register save/restore when a guest is not frequently using them. When a guest does use them often, there will be some additional demand fault overhead, but these are not commonly used facilities. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-37-npiggin@gmail.com
|
#
d55b1ecc |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Restrict DSISR canary workaround to processors that require it Use CPU_FTR_P9_RADIX_PREFETCH_BUG to apply the workaround, to test for DD2.1 and below processors. This saves a mtSPR in guest entry. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-35-npiggin@gmail.com
|
#
3e7b3379 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Switch PMU to guest as late as possible This moves PMU switch to guest as late as possible in entry, and switch back to host as early as possible at exit. This helps the host get the most perf coverage of KVM entry/exit code as possible. This is slightly suboptimal for SPR scheduling point of view when the PMU is enabled, but when perf is disabled there is no real difference. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-34-npiggin@gmail.com
|
#
d5f48019 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move remaining SPR and MSR access into low level entry Move register saving and loading from kvmhv_p9_guest_entry() into the HV and nested entry handlers. Accesses are scheduled to reduce mtSPR / mfSPR interleaving which reduces SPR scoreboard stalls. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-32-npiggin@gmail.com
|
#
08b3f08a |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move nested guest entry into its own function Move the part of the guest entry which is specific to nested HV into its own function. This is just refactoring. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-31-npiggin@gmail.com
|
#
aabcaf6a |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move host OS save/restore functions to built-in Move the P9 guest/host register switching functions to the built-in P9 entry code, and export it for nested to use as well. This allows more flexibility in scheduling these supervisor privileged SPR accesses with the HV privileged and PR SPR accesses in the low level entry code. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-30-npiggin@gmail.com
|
#
516b3342 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move vcpu register save/restore into functions This should be no functional difference but makes the caller easier to read. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-29-npiggin@gmail.com
|
#
0f3b6c48 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Juggle SPR switching around This juggles SPR switching on the entry and exit sides to be more symmetric, which makes the next refactoring patch possible with no functional change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-28-npiggin@gmail.com
|
#
9dfe7aa7 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Only execute mtSPR if the value changed Keep better track of the current SPR value in places where they are to be loaded with a new context, to reduce expensive mtSPR operations. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-27-npiggin@gmail.com
|
#
9a1e530b |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Avoid SPR scoreboard stalls Avoid interleaving mfSPR and mtSPR to reduce SPR scoreboard stalls. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-26-npiggin@gmail.com
|
#
cb2553a0 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Optimise timebase reads Reduce the number of mfTB executed by passing the current timebase around entry and exit code rather than read it multiple times. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-25-npiggin@gmail.com
|
#
3c1a4322 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Change dec_expires to be relative to guest timebase Change dec_expires to be relative to the guest timebase, and allow it to be moved into low level P9 guest entry functions, to improve SPR access scheduling. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-23-npiggin@gmail.com
|
#
cf99dedb |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Add kvmppc_stop_thread to match kvmppc_start_thread Small cleanup makes it a bit easier to match up entry and exit operations. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-22-npiggin@gmail.com
|
#
2251fbe7 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Improve mtmsrd scheduling by delaying MSR[EE] disable Moving the mtmsrd after the host SPRs are saved and before the guest SPRs start to be loaded can prevent an SPR scoreboard stall (because the mtmsrd is L=1 type which does not cause context synchronisation. This is also now more convenient to combined with the mtmsrd L=0 instruction to enable facilities just below, but that is not done yet. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-21-npiggin@gmail.com
|
#
34e119c9 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Reduce mtmsrd instructions required to save host SPRs This reduces the number of mtmsrd required to enable facility bits when saving/restoring registers, by having the KVM code set all bits up front rather than using individual facility functions that set their particular MSR bits. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-20-npiggin@gmail.com
|
#
174a3ab6 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move SPRG restore to restore_p9_host_os_sprs Move the SPR update into its relevant helper function. This will help with SPR scheduling improvements in later changes. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-19-npiggin@gmail.com
|
#
a1a19e11 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: CTRL SPR does not require read-modify-write Processors that support KVM HV do not require read-modify-write of the CTRL SPR to set/clear their thread's runlatch. Just write 1 or 0 to it. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-18-npiggin@gmail.com
|
#
b1adcf57 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Factor out yield_count increment Factor duplicated code into a helper function. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-17-npiggin@gmail.com
|
#
9d3ddb86 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Demand fault PMU SPRs when marked not inuse The pmcregs_in_use field in the guest VPA can not be trusted to reflect what the guest is doing with PMU SPRs, so the PMU must always be managed (stopped) when exiting the guest, and SPR values set when entering the guest to ensure it can't cause a covert channel or otherwise cause other guests or the host to misbehave. So prevent guest access to the PMU with HFSCR[PM] if pmcregs_in_use is clear, and avoid the PMU SPR access on every partition switch. Guests that set pmcregs_in_use incorrectly or when first setting it and using the PMU will take a hypervisor facility unavailable interrupt that will bring in the PMU SPRs. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-16-npiggin@gmail.com
|
#
401e1ae3 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Factor PMU save/load into context switch functions Rather than guest/host save/retsore functions, implement context switch functions that take care of details like the VPA update for nested. The reason to split these kind of helpers into explicit save/load functions is mainly to schedule SPR access nicely, but PMU is a special case where the load requires mtSPR (to stop counters) and other difficulties, so there's less possibility to schedule those nicely. The SPR accesses also have side-effects if the PMU is running, and in later changes we keep the host PMU running as long as possible so this code can be better profiled, which also complicates scheduling. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-15-npiggin@gmail.com
|
#
57dc0eed |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Implement PMU save/restore in C Implement the P9 path PMU save/restore code in C, and remove the POWER9/10 code from the P7/8 path assembly. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-14-npiggin@gmail.com
|
#
245ebf8e |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64s: Always set PMU control registers to frozen/disabled when not in use KVM PMU management code looks for particular frozen/disabled bits in the PMU registers so it knows whether it must clear them when coming out of a guest or not. Setting this up helps KVM make these optimisations without getting confused. Longer term the better approach might be to move guest/host PMU switching to the perf subsystem. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-12-npiggin@gmail.com
|
#
d3c8a2d3 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Don't always save PMU for guest capable of nesting Provide a config option that controls the workaround added by commit 63279eeb7f93 ("KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting"). The option defaults to y for now, but is expected to go away within a few releases. Nested capable guests running with the earlier commit 178266389794 ("KVM: PPC: Book3S HV Nested: Reflect guest PMU in-use to L0 when guest SPRs are live") will now indicate the PMU in-use status of their guests, which means the parent does not need to unconditionally save the PMU for nested capable guests. After this latest round of performance optimisations, this option costs about 540 cycles or 10% entry/exit performance on a POWER9 nested-capable guest. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> References: 178266389794 ("KVM: PPC: Book3S HV Nested: Reflect guest PMU in-use to L0 when guest SPRs are live") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-11-npiggin@gmail.com
|
#
eacc8188 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: POWER10 enable HAIL when running radix guests HV interrupts may be taken with the MMU enabled when radix guests are running. Enable LPCR[HAIL] on ISA v3.1 processors for radix guests. Make this depend on the host LPCR[HAIL] being enabled. Currently that is always enabled, but having this test means any issue that might require LPCR[HAIL] to be disabled in the host will not have to be duplicated in KVM. This optimisation takes 1380 cycles off a NULL hcall entry+exit micro benchmark on a POWER10. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-9-npiggin@gmail.com
|
#
25aa1458 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/time: add API for KVM to re-arm the host timer/decrementer Rather than have KVM look up the host timer and fiddle with the irq-work internal details, have the powerpc/time.c code provide a function for KVM to re-arm the Linux timer code when exiting a guest. This is implementation has an improvement over existing code of marking a decrementer interrupt as soft-pending if a timer has expired, rather than setting DEC to a -ve value, which tended to cause host timers to take two interrupts (first hdec to exit the guest, then the immediate dec). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-8-npiggin@gmail.com
|
#
34bf08a2 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Reduce mftb per guest entry/exit mftb is serialising (dispatch next-to-complete) so it is heavy weight for a mfspr. Avoid reading it multiple times in the entry or exit paths. A small number of cycles delay to timers is tolerable. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-7-npiggin@gmail.com
|
#
4ebbd075 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Use host timer accounting to avoid decrementer read There is no need to save away the host DEC value, as it is derived from the host timer subsystem which maintains the next timer time, so it can be restored from there. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-5-npiggin@gmail.com
|
#
5955c746 |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KMV: PPC: Book3S HV P9: Use set_dec to set decrementer to host The host Linux timer code arms the decrementer with the value 'decrementers_next_tb - current_tb' using set_dec(), which stores val - 1 on Book3S-64, which is not quite the same as what KVM does to re-arm the host decrementer when exiting the guest. This shouldn't be a significant change, but it makes the logic match and avoids this small extra change being brought into the next patch. Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-4-npiggin@gmail.com
|
#
736df58f |
|
23-Nov-2021 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64s: guard optional TIDR SPR with CPU ftr test The TIDR SPR only exists on POWER9. Avoid accessing it when the feature bit for it is not set. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-3-npiggin@gmail.com
|
#
235cee16 |
|
27-Oct-2021 |
Laurent Vivier <lvivier@redhat.com> |
KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling Commit 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs") moved guest_exit() into the interrupt protected area to avoid wrong context warning (or worse). The problem is that tick-based time accounting has not yet been updated at this point (because it depends on the timer interrupt firing), so the guest time gets incorrectly accounted to system time. To fix the problem, follow the x86 fix in commit 160457140187 ("Defer vtime accounting 'til after IRQ handling"), and allow host IRQs to run before accounting the guest exit time. In the case vtime accounting is enabled, this is not required because TB is used directly for accounting. Before this patch, with CONFIG_TICK_CPU_ACCOUNTING=y in the host and a guest running a kernel compile, the 'guest' fields of /proc/stat are stuck at zero. With the patch they can be observed increasing roughly as expected. Fixes: e233d54d4d97 ("KVM: booke: use __kvm_guest_exit") Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs") Cc: stable@vger.kernel.org # 5.12+ Signed-off-by: Laurent Vivier <lvivier@redhat.com> [np: only required for tick accounting, add Book3E fix, tweak changelog] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211027142150.3711582-1-npiggin@gmail.com
|
#
8ccba534 |
|
02-Aug-2021 |
Jing Zhang <jingzhangos@google.com> |
KVM: stats: Add halt polling related histogram stats Add three log histogram stats to record the distribution of time spent on successful polling, failed polling and VCPU wait. halt_poll_success_hist: Distribution of spent time for a successful poll. halt_poll_fail_hist: Distribution of spent time for a failed poll. halt_wait_hist: Distribution of time a VCPU has spent on waiting. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-6-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
87bcc5fa |
|
02-Aug-2021 |
Jing Zhang <jingzhangos@google.com> |
KVM: stats: Add halt_wait_ns stats for all architectures Add simple stats halt_wait_ns to record the time a VCPU has spent on waiting for all architectures (not just powerpc). Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-5-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0c8fb653 |
|
11-Aug-2021 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64s: Remove WORT SPR from POWER9/10 This register is not architected and not implemented in POWER9 or 10, it just reads back zeroes for compatibility. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Link: https://lore.kernel.org/r/20210811160134.904987-11-npiggin@gmail.com
|
#
17826638 |
|
11-Aug-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV Nested: Reflect guest PMU in-use to L0 when guest SPRs are live After the L1 saves its PMU SPRs but before loading the L2's PMU SPRs, switch the pmcregs_in_use field in the L1 lppaca to the value advertised by the L2 in its VPA. On the way out of the L2, set it back after saving the L2 PMU registers (if they were in-use). This transfers the PMU liveness indication between the L1 and L2 at the points where the registers are not live. This fixes the nested HV bug for which a workaround was added to the L0 HV by commit 63279eeb7f93a ("KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting"), which explains the problem in detail. That workaround is no longer required for guests that include this bug fix. Fixes: 360cae313702 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Link: https://lore.kernel.org/r/20210811160134.904987-10-npiggin@gmail.com
|
#
7c3ded57 |
|
11-Aug-2021 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV Nested: Stop forwarding all HFUs to L1 If the nested hypervisor has no access to a facility because it has been disabled by the host, it should also not be able to see the Hypervisor Facility Unavailable that arises from one of its guests trying to access the facility. This patch turns a HFU that happened in L2 into a Hypervisor Emulation Assistance interrupt and forwards it to L1 for handling. The ones that happened because L1 explicitly disabled the facility for L2 are still let through, along with the corresponding Cause bits in the HFSCR. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> [np: move handling into kvmppc_handle_nested_exit] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210811160134.904987-8-npiggin@gmail.com
|
#
8b210a88 |
|
11-Aug-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV Nested: Make nested HFSCR state accessible When the L0 runs a nested L2, there are several permutations of HFSCR that can be relevant. The HFSCR that the L1 vcpu L1 requested, the HFSCR that the L1 vcpu may use, and the HFSCR that is actually being used to run the L2. The L1 requested HFSCR is not accessible outside the nested hcall handler, so copy that into a new kvm_nested_guest.hfscr field. The permitted HFSCR is taken from the HFSCR that the L1 runs with, which is also not accessible while the hcall is being made. Move this into a new kvm_vcpu_arch.hfscr_permitted field. These will be used by the next patch to improve facility handling for nested guests, and later by facility demand faulting patches. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210811160134.904987-7-npiggin@gmail.com
|
#
d82b392d |
|
11-Aug-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV Nested: Fix TM softpatch HFAC interrupt emulation Have the TM softpatch emulation code set up the HFAC interrupt and return -1 in case an instruction was executed with HFSCR bits clear, and have the interrupt exit handler fall through to the HFAC handler. When the L0 is running a nested guest, this ensures the HFAC interrupt is correctly passed up to the L1. The "direct guest" exit handler will turn these into PROGILL program interrupts so functionality in practice will be unchanged. But it's possible an L1 would want to handle these in a different way. Also rearrange the FAC interrupt emulation code to match the HFAC format while here (mainly, adding the FSCR_INTR_CAUSE mask). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210811160134.904987-5-npiggin@gmail.com
|
#
fd42b7b0 |
|
11-Aug-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Initialise vcpu MSR with MSR_ME It is possible to create a VCPU without setting the MSR before running it, which results in a warning in kvmhv_vcpu_entry_p9() that MSR_ME is not set. This is pretty harmless because the MSR_ME bit is added to HSRR1 before HRFID to guest, and a normal qemu guest doesn't hit it. Initialise the vcpu MSR with MSR_ME set. Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210811160134.904987-2-npiggin@gmail.com
|
#
1753081f |
|
01-Jul-2021 |
Cédric Le Goater <clg@kaod.org> |
KVM: PPC: Book3S HV: XICS: Fix mapping of passthrough interrupts PCI MSIs now live in an MSI domain but the underlying calls, which will EOI the interrupt in real mode, need an HW IRQ number mapped in the XICS IRQ domain. Grab it there. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210701132750.1475580-31-clg@kaod.org
|
#
e5e78b15 |
|
01-Jul-2021 |
Cédric Le Goater <clg@kaod.org> |
KVM: PPC: Book3S HV: XIVE: Change interface of passthrough interrupt routines The routine kvmppc_set_passthru_irq() calls kvmppc_xive_set_mapped() and kvmppc_xive_clr_mapped() with an IRQ descriptor. Use directly the host IRQ number to remove a useless conversion. Add some debug. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210701132750.1475580-15-clg@kaod.org
|
#
ba418a02 |
|
01-Jul-2021 |
Cédric Le Goater <clg@kaod.org> |
KVM: PPC: Book3S HV: Use the new IRQ chip to detect passthrough interrupts Passthrough PCI MSI interrupts are detected in KVM with a check on a specific EOI handler (P8) or on XIVE (P9). We can now check the PCI-MSI IRQ chip which is cleaner. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210701132750.1475580-14-clg@kaod.org
|
#
2ac78e0c |
|
05-Aug-2021 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
KVM: PPC: Use arch_get_random_seed_long instead of powernv variant The powernv_get_random_long() does not work in nested KVM (which is pseries) and produces a crash when accessing in_be64(rng->regs) in powernv_get_random_long(). This replaces powernv_get_random_long with the ppc_md machine hook wrapper. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210805075649.2086567-1-aik@ozlabs.ru
|
#
bd31ecf4 |
|
15-Jul-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S: Fix CONFIG_TRANSACTIONAL_MEM=n crash When running CPU_FTR_P9_TM_HV_ASSIST, HFSCR[TM] is set for the guest even if the host has CONFIG_TRANSACTIONAL_MEM=n, which causes it to be unprepared to handle guest exits while transactional. Normal guests don't have a problem because the HTM capability will not be advertised, but a rogue or buggy one could crash the host. Fixes: 4bb3c7a0208f ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210716024310.164448-1-npiggin@gmail.com
|
#
59dc5bfc |
|
17-Jun-2021 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64s: avoid reloading (H)SRR registers if they are still valid When an interrupt is taken, the SRR registers are set to return to where it left off. Unless they are modified in the meantime, or the return address or MSR are modified, there is no need to reload these registers when returning from interrupt. Introduce per-CPU flags that track the validity of SRR and HSRR registers. These are cleared when returning from interrupt, when using the registers for something else (e.g., OPAL calls), when adjusting the return address or MSR of a context, and when context switching (which changes the return address and MSR). This improves the performance of interrupt returns. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Fold in fixup patch from Nick] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210617155116.2167984-5-npiggin@gmail.com
|
#
900c83f8 |
|
28-Jun-2021 |
Liam Howlett <liam.howlett@oracle.com> |
arch/powerpc/kvm/book3s: use vma_lookup() in kvmppc_hv_setup_htab_rma() Using vma_lookup() removes the requirement to check if the address is within the returned vma. The code is easier to understand and more compact. Link: https://lkml.kernel.org/r/20210521174745.2219620-7-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0193cc90 |
|
18-Jun-2021 |
Jing Zhang <jingzhangos@google.com> |
KVM: stats: Separate generic stats from architecture specific ones Generic KVM stats are those collected in architecture independent code or those supported by all architectures; put all generic statistics in a separate structure. This ensures that they are defined the same way in the statistics API which is being added, removing duplication among different architectures in the declaration of the descriptors. No functional change intended. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-2-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
53324b51 |
|
21-Jun-2021 |
Bharata B Rao <bharata@linux.ibm.com> |
KVM: PPC: Book3S HV: Nested support in H_RPT_INVALIDATE Enable support for process-scoped invalidations from nested guests and partition-scoped invalidations for nested guests. Process-scoped invalidations for any level of nested guests are handled by implementing H_RPT_INVALIDATE handler in the nested guest exit path in L0. Partition-scoped invalidation requests are forwarded to the right nested guest, handled there and passed down to L0 for eventual handling. Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> [aneesh: Nested guest partition-scoped invalidation changes] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Squash in fixup patch] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210621085003.904767-5-bharata@linux.ibm.com
|
#
f0c6fbbb |
|
21-Jun-2021 |
Bharata B Rao <bharata@linux.ibm.com> |
KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE H_RPT_INVALIDATE does two types of TLB invalidations: 1. Process-scoped invalidations for guests when LPCR[GTSE]=0. This is currently not used in KVM as GTSE is not usually disabled in KVM. 2. Partition-scoped invalidations that an L1 hypervisor does on behalf of an L2 guest. This is currently handled by H_TLB_INVALIDATE hcall and this new replaces the old that. This commit enables process-scoped invalidations for L1 guests. Support for process-scoped and partition-scoped invalidations from/for nested guests will be added separately. Process scoped tlbie invalidations from L1 and nested guests need RS register for TLBIE instruction to contain both PID and LPID. This patch introduces primitives that execute tlbie instruction with both PID and LPID set in prepartion for H_RPT_INVALIDATE hcall. A description of H_RPT_INVALIDATE follows: int64 /* H_Success: Return code on successful completion */ /* H_Busy - repeat the call with the same */ /* H_Parameter, H_P2, H_P3, H_P4, H_P5 : Invalid parameters */ hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate RPT translation lookaside information */ uint64 id, /* PID/LPID to invalidate */ uint64 target, /* Invalidation target */ uint64 type, /* Type of lookaside information */ uint64 pg_sizes, /* Page sizes */ uint64 start, /* Start of Effective Address (EA) range (inclusive) */ uint64 end) /* End of EA range (exclusive) */ Invalidation targets (target) ----------------------------- Core MMU 0x01 /* All virtual processors in the partition */ Core local MMU 0x02 /* Current virtual processor */ Nest MMU 0x04 /* All nest/accelerator agents in use by the partition */ A combination of the above can be specified, except core and core local. Type of translation to invalidate (type) --------------------------------------- NESTED 0x0001 /* invalidate nested guest partition-scope */ TLB 0x0002 /* Invalidate TLB */ PWC 0x0004 /* Invalidate Page Walk Cache */ PRT 0x0008 /* Invalidate caching of Process Table Entries if NESTED is clear */ PAT 0x0008 /* Invalidate caching of Partition Table Entries if NESTED is set */ A combination of the above can be specified. Page size mask (pages) ---------------------- 4K 0x01 64K 0x02 2M 0x04 1G 0x08 All sizes (-1UL) A combination of the above can be specified. All page sizes can be selected with -1. Semantics: Invalidate radix tree lookaside information matching the parameters given. * Return H_P2, H_P3 or H_P4 if target, type, or pageSizes parameters are different from the defined values. * Return H_PARAMETER if NESTED is set and pid is not a valid nested LPID allocated to this partition * Return H_P5 if (start, end) doesn't form a valid range. Start and end should be a valid Quadrant address and end > start. * Return H_NotSupported if the partition is not in running in radix translation mode. * May invalidate more translation information than requested. * If start = 0 and end = -1, set the range to cover all valid addresses. Else start and end should be aligned to 4kB (lower 11 bits clear). * If NESTED is clear, then invalidate process scoped lookaside information. Else pid specifies a nested LPID, and the invalidation is performed on nested guest partition table and nested guest partition scope real addresses. * If pid = 0 and NESTED is clear, then valid addresses are quadrant 3 and quadrant 0 spaces, Else valid addresses are quadrant 0. * Pages which are fully covered by the range are to be invalidated. Those which are partially covered are considered outside invalidation range, which allows a caller to optimally invalidate ranges that may contain mixed page sizes. * Return H_SUCCESS on success. Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210621085003.904767-4-bharata@linux.ibm.com
|
#
77bbbc0c |
|
01-Jun-2021 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Fix TLB management on SMT8 POWER9 and POWER10 processors The POWER9 vCPU TLB management code assumes all threads in a core share a TLB, and that TLBIEL execued by one thread will invalidate TLBs for all threads. This is not the case for SMT8 capable POWER9 and POWER10 (big core) processors, where the TLB is split between groups of threads. This results in TLB multi-hits, random data corruption, etc. Fix this by introducing cpu_first_tlb_thread_sibling etc., to determine which siblings share TLBs, and use that in the guest TLB flushing code. [npiggin@gmail.com: add changelog and comment] Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210602040441.3984352-1-npiggin@gmail.com
|
#
fae5c9f3 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: remove ISA v3.0 and v3.1 support from P7/8 path POWER9 and later processors always go via the P9 guest entry path now. Remove the remaining support from the P7/8 path. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-33-npiggin@gmail.com
|
#
0bf7e1b2 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: implement hash host / hash guest support Implement support for hash guests under hash host. This has to save and restore the host SLB, and ensure that the MMU is off while switching into the guest SLB. POWER9 and later CPUs now always go via the P9 path. The "fast" guest mode is now renamed to the P9 mode, which is consistent with its functionality and the rest of the naming. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-32-npiggin@gmail.com
|
#
079a09a5 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: implement hash guest support Implement hash guest support. Guest entry/exit has to restore and save/clear the SLB, plus several other bits to accommodate hash guests in the P9 path. Radix host, hash guest support is removed from the P7/8 path. The HPT hcalls and faults are not handled in real mode, which is a performance regression. A worst-case fork/exit microbenchmark takes 3x longer after this patch. kbuild benchmark performance is in the noise, but the slowdown is likely to be noticed somewhere. For now, accept this penalty for the benefit of simplifying the P7/8 paths and unifying P9 hash with the new code, because hash is a less important configuration than radix on processors that support it. Hash will benefit from future optimisations to this path, including possibly a faster path to handle such hcalls and interrupts without doing a full exit. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-31-npiggin@gmail.com
|
#
ac3c8b41 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Reflect userspace hcalls to hash guests to support PR KVM The reflection of sc 1 interrupts from guest PR=1 to the guest kernel is required to support a hash guest running PR KVM where its guest is making hcalls with sc 1. In preparation for hash guest support, add this hcall reflection to the P9 path. The P7/8 path does this in its realmode hcall handler (sc_1_fast_return). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-30-npiggin@gmail.com
|
#
6165d5dd |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: add virtual mode handlers for HPT hcalls and page faults In order to support hash guests in the P9 path (which does not do real mode hcalls or page fault handling), these real-mode hash specific interrupts need to be implemented in virt mode. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-29-npiggin@gmail.com
|
#
a9aa86e0 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: small pseries_do_hcall cleanup Functionality should not be changed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-28-npiggin@gmail.com
|
#
cbcff8b1 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Allow all P9 processors to enable nested HV All radix guests go via the P9 path now, so there is no need to limit nested HV to processors that support "mixed mode" MMU. Remove the restriction. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-27-npiggin@gmail.com
|
#
aaae8c79 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Remove support for dependent threads mode on P9 Dependent-threads mode is the normal KVM mode for pre-POWER9 SMT processors, where all threads in a core (or subcore) would run the same partition at the same time, or they would run the host. This design was mandated by MMU state that is shared between threads in a processor, so the synchronisation point is in hypervisor real-mode that has essentially no shared state, so it's safe for multiple threads to gather and switch to the correct mode. It is implemented by having the host unplug all secondary threads and always run in SMT1 mode, and host QEMU threads essentially represent virtual cores that wake these secondary threads out of unplug when the ioctl is called to run the guest. This happens via a side-path that is mostly invisible to the rest of the Linux host and the secondary threads still appear to be unplugged. POWER9 / ISA v3.0 has a more flexible MMU design that is independent per-thread and allows a much simpler KVM implementation. Before the new "P9 fast path" was added that began to take advantage of this, POWER9 support was implemented in the existing path which has support to run in the dependent threads mode. So it was not much work to add support to run POWER9 in this dependent threads mode. The mode is not required by the POWER9 MMU (although "mixed-mode" hash / radix MMU limitations of early processors were worked around using this mode). But it is one way to run SMT guests without running different guests or guest and host on different threads of the same core, so it could avoid or reduce some SMT attack surfaces without turning off SMT entirely. This security feature has some real, if indeterminate, value. However the old path is lagging in features (nested HV), and with this series the new P9 path adds remaining missing features (radix prefetch bug and hash support, in later patches), so POWER9 dependent threads mode support would be the only remaining reason to keep that code in and keep supporting POWER9/POWER10 in the old path. So here we make the call to drop this feature. Remove dependent threads mode support for POWER9 and above processors. Systems can still achieve this security by disabling SMT entirely, but that would generally come at a larger performance cost for guests. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-23-npiggin@gmail.com
|
#
2e1ae9cd |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Implement radix prefetch workaround by disabling MMU Rather than partition the guest PID space + flush a rogue guest PID to work around this problem, instead fix it by always disabling the MMU when switching in or out of guest MMU context in HV mode. This may be a bit less efficient, but it is a lot less complicated and allows the P9 path to trivally implement the workaround too. Newer CPUs are not subject to this issue. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-22-npiggin@gmail.com
|
#
edba6aff |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Add helpers for OS SPR handling This is a first step to wrapping supervisor and user SPR saving and loading up into helpers, which will then be called independently in bare metal and nested HV cases in order to optimise SPR access. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-20-npiggin@gmail.com
|
#
6d770e3f |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Read machine check registers while MSR[RI] is 0 SRR0/1, DAR, DSISR must all be protected from machine check which can clobber them. Ensure MSR[RI] is clear while they are live. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-17-npiggin@gmail.com
|
#
c00366e2 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: inline kvmhv_load_hv_regs_and_go into __kvmhv_vcpu_entry_p9 Now the initial C implementation is done, inline more HV code to make rearranging things easier. And rename __kvmhv_vcpu_entry_p9 to drop the leading underscores as it's now C, and is now a more complete vcpu entry. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-16-npiggin@gmail.com
|
#
89d35b23 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C Almost all logic is moved to C, by introducing a new in_guest mode for the P9 path that branches very early in the KVM interrupt handler to P9 exit code. The main P9 entry and exit assembly is now only about 160 lines of low level stack setup and register save/restore, plus a bad-interrupt handler. There are two motivations for this, the first is just make the code more maintainable being in C. The second is to reduce the amount of code running in a special KVM mode, "realmode". In quotes because with radix it is no longer necessarily real-mode in the MMU, but it still has to be treated specially because it may be in real-mode, and has various important registers like PID, DEC, TB, etc set to guest. This is hostile to the rest of Linux and can't use arbitrary kernel functionality or be instrumented well. This initial patch is a reasonably faithful conversion of the asm code, but it does lack any loop to return quickly back into the guest without switching out of realmode in the case of unimportant or easily handled interrupts. As explained in previous changes, handling HV interrupts very quickly in this low level realmode is not so important for P9 performance, and are important to avoid for security, observability, debugability reasons. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-15-npiggin@gmail.com
|
#
9dc2babc |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path In the interest of minimising the amount of code that is run in "real-mode", don't handle hcalls in real mode in the P9 path. This requires some new handlers for H_CEDE and xics-on-xive to be added before xive is pulled or cede logic is checked. This introduces a change in radix guest behaviour where radix guests that execute 'sc 1' in userspace now get a privilege fault whereas previously the 'sc 1' would be reflected as a syscall interrupt to the guest kernel. That reflection is only required for hash guests that run PR KVM. Background: In POWER8 and earlier processors, it is very expensive to exit from the HV real mode context of a guest hypervisor interrupt, and switch to host virtual mode. On those processors, guest->HV interrupts reach the hypervisor with the MMU off because the MMU is loaded with guest context (LPCR, SDR1, SLB), and the other threads in the sub-core need to be pulled out of the guest too. Then the primary must save off guest state, invalidate SLB and ERAT, and load up host state before the MMU can be enabled to run in host virtual mode (~= regular Linux mode). Hash guests also require a lot of hcalls to run due to the nature of the MMU architecture and paravirtualisation design. The XICS interrupt controller requires hcalls to run. So KVM traditionally tries hard to avoid the full exit, by handling hcalls and other interrupts in real mode as much as possible. By contrast, POWER9 has independent MMU context per-thread, and in radix mode the hypervisor is in host virtual memory mode when the HV interrupt is taken. Radix guests do not require significant hcalls to manage their translations, and xive guests don't need hcalls to handle interrupts. So it's much less important for performance to handle hcalls in real mode on POWER9. One caveat is that the TCE hcalls are performance critical, real-mode variants introduced for POWER8 in order to achieve 10GbE performance. Real mode TCE hcalls were found to be less important on POWER9, which was able to drive 40GBe networking without them (using the virt mode hcalls) but performance is still important. These hcalls will benefit from subsequent guest entry/exit optimisation including possibly a faster "partial exit" that does not entirely switch to host context to handle the hcall. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-14-npiggin@gmail.com
|
#
48013cbc |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move radix MMU switching instructions together Switching the MMU from radix<->radix mode is tricky particularly as the MMU can remain enabled and requires a certain sequence of SPR updates. Move these together into their own functions. This also includes the radix TLB check / flush because it's tied in to MMU switching due to tlbiel getting LPID from LPIDR. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-13-npiggin@gmail.com
|
#
09512c29 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move xive vcpu context management into kvmhv_p9_guest_entry Move the xive management up so the low level register switching can be pushed further down in a later patch. XIVE MMIO CI operations can run in higher level code with machine checks, tracing, etc., available. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-12-npiggin@gmail.com
|
#
6ffe2c6e |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races irq_work's use of the DEC SPR is racy with guest<->host switch and guest entry which flips the DEC interrupt to guest, which could lose a host work interrupt. This patch closes one race, and attempts to comment another class of races. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-11-npiggin@gmail.com
|
#
413679e7 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Move setting HDEC after switching to guest LPCR LPCR[HDICE]=0 suppresses hypervisor decrementer exceptions on some processors, so it must be enabled before HDEC is set. Rather than set it in the host LPCR then setting HDEC, move the HDEC update to after the guest MMU context (including LPCR) is loaded. There shouldn't be much concern with delaying HDEC by some 10s or 100s of nanoseconds by setting it a bit later. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-10-npiggin@gmail.com
|
#
023c3c96 |
|
28-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: implement kvmppc_xive_pull_vcpu in C This is more symmetric with kvmppc_xive_push_vcpu, and has the advantage that it runs with the MMU on. The extra test added to the asm will go away with a future change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210528090752.3542186-9-npiggin@gmail.com
|
#
6ba53317 |
|
26-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Save host FSCR in the P7/8 path Similar to commit 25edcc50d76c ("KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path"), ensure the P7/8 path saves and restores the host FSCR. The logic explained in that patch actually applies there to the old path well: a context switch can be made before kvmppc_vcpu_run_hv restores the host FSCR and returns. Now both the p9 and the p7/8 paths now save and restore their FSCR, it no longer needs to be restored at the end of kvmppc_vcpu_run_hv Fixes: b005255e12a3 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs") Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210526125851.3436735-1-npiggin@gmail.com
|
#
1438709e |
|
26-May-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Save host FSCR in the P7/8 path Similar to commit 25edcc50d76c ("KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path"), ensure the P7/8 path saves and restores the host FSCR. The logic explained in that patch actually applies there to the old path well: a context switch can be made before kvmppc_vcpu_run_hv restores the host FSCR and returns. Now both the p9 and the p7/8 paths now save and restore their FSCR, it no longer needs to be restored at the end of kvmppc_vcpu_run_hv Fixes: b005255e12a3 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs") Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210526125851.3436735-1-npiggin@gmail.com
|
#
6bd5b743 |
|
18-May-2021 |
Wanpeng Li <wanpengli@tencent.com> |
KVM: PPC: exit halt polling on need_resched() This is inspired by commit 262de4102c7bb8 (kvm: exit halt polling on need_resched() as well). Due to PPC implements an arch specific halt polling logic, we have to the need_resched() check there as well. This patch adds a helper function that can be shared between book3s and generic halt-polling loops. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Ben Segall <bsegall@google.com> Cc: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Jim Mattson <jmattson@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-1-git-send-email-wanpengli@tencent.com> [Make the function inline. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b1c5356e |
|
01-Apr-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: PPC: Convert to the gfn-based MMU notifier callbacks Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion PPC-internal hooks to work with gfns instead of hvas. No meaningful functional change intended, though the exact order of operations is slightly different since the memslot lookups occur before calling into arch code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0fd85cb8 |
|
11-Apr-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Fix CONFIG_SPAPR_TCE_IOMMU=n default hcalls This config option causes the warning in init_default_hcalls to fire because the TCE handlers are in the default hcall list but not implemented. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210412014845.1517916-9-npiggin@gmail.com
|
#
4b5f0a0d |
|
11-Apr-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Remove redundant mtspr PSPB This SPR is set to 0 twice when exiting the guest. Suggested-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210412014845.1517916-7-npiggin@gmail.com
|
#
72c15287 |
|
11-Apr-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Prevent radix guests setting LPCR[TC] Prevent radix guests setting LPCR[TC]. This bit only applies to hash partitions. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210412014845.1517916-6-npiggin@gmail.com
|
#
bcc92a0d |
|
11-Apr-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Disallow LPCR[AIL] to be set to 1 or 2 These are already disallowed by H_SET_MODE from the guest, also disallow these by updating LPCR directly. AIL modes can affect the host interrupt behaviour while the guest LPCR value is set, so filter it here too. Suggested-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210412014845.1517916-5-npiggin@gmail.com
|
#
67145ef4 |
|
11-Apr-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Add a function to filter guest LPCR bits Guest LPCR depends on hardware type, and future changes will add restrictions based on errata and guest MMU mode. Move this logic to a common function and use it for the cases where the guest wants to update its LPCR (or the LPCR of a nested guest). This also adds a warning in other places that set or update LPCR if we try to set something that would have been disallowed by the filter, as a sanity check. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210412014845.1517916-4-npiggin@gmail.com
|
#
5088eb40 |
|
11-Apr-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV P9: Restore host CTRL SPR after guest exit The host CTRL (runlatch) value is not restored after guest exit. The host CTRL should always be 1 except in CPU idle code, so this can result in the host running with runlatch clear, and potentially switching to a different vCPU which then runs with runlatch clear as well. This has little effect on P9 machines, CTRL is only responsible for some PMU counter logic in the host and so other than corner cases of software relying on that, or explicitly reading the runlatch value (Linux does not appear to be affected but it's possible non-Linux guests could be), there should be no execution correctness problem, though it could be used as a covert channel between guests. There may be microcontrollers, firmware or monitoring tools that sample the runlatch value out-of-band, however since the register is writable by guests, these values would (should) not be relied upon for correct operation of the host, so suboptimal performance or incorrect reporting should be the worst problem. Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210412014845.1517916-2-npiggin@gmail.com
|
#
3a96570f |
|
30-Jan-2021 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc: convert interrupt handlers to use wrappers Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-29-npiggin@gmail.com
|
#
11266528 |
|
30-Jan-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs Interrupts that occur in kernel mode expect that context tracking is set to kernel. Enabling local irqs before context tracking switches from guest to host means interrupts can come in and trigger warnings about wrong context, and possibly worse. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210130130852.2952424-3-npiggin@gmail.com
|
#
a722076e |
|
05-Feb-2021 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2 These machines don't support running both MMU types at the same time, so remove the KVM_CAP_PPC_MMU_HASH_V3 capability when the host is using Radix MMU. [paulus@ozlabs.org - added defensive check on kvmppc_hv_ops->hash_v3_possible] Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
25edcc50 |
|
04-Feb-2021 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path The Facility Status and Control Register is a privileged SPR that defines the availability of some features in problem state. Since it can be written by the guest, we must restore it to the previous host value after guest exit. This restoration is currently done by taking the value from current->thread.fscr, which in the P9 path is not enough anymore because the guest could context switch the QEMU thread, causing the guest-current value to be saved into the thread struct. The above situation manifested when running a QEMU linked against a libc with System Call Vectored support, which causes scv instructions to be run by QEMU early during the guest boot (during SLOF), at which point the FSCR is 0 due to guest entry. After a few scv calls (1 to a couple hundred), the context switching happens and the QEMU thread runs with the guest value, resulting in a Facility Unavailable interrupt. This patch saves and restores the host value of FSCR in the inner guest entry loop in a way independent of current->thread.fscr. The old way of doing it is still kept in place because it works for the old entry path. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
b1b1697a |
|
17-Jan-2021 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support This reverts much of commit c01015091a770 ("KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts"), which was required to run HPT guests on RPT hosts on early POWER9 CPUs without support for "mixed mode", which meant the host could not run with MMU on while guests were running. This code has some corner case bugs, e.g., when the guest hits a machine check or HMI the primary locks up waiting for secondaries to switch LPCR to host, which they never do. This could all be fixed in software, but most CPUs in production have mixed mode support, and those that don't are believed to be all in installations that don't use this capability. So simplify things and remove support. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
d9a47eda |
|
16-Dec-2020 |
Ravi Bangoria <ravi.bangoria@linux.ibm.com> |
KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR Introduce KVM_CAP_PPC_DAWR1 which can be used by QEMU to query whether KVM supports 2nd DAWR or not. The capability is by default disabled even when the underlying CPU supports 2nd DAWR. QEMU needs to check and enable it manually to use the feature. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
bd1de1a0 |
|
16-Dec-2020 |
Ravi Bangoria <ravi.bangoria@linux.ibm.com> |
KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR KVM code assumes single DAWR everywhere. Add code to support 2nd DAWR. DAWR is a hypervisor resource and thus H_SET_MODE hcall is used to set/ unset it. Introduce new case H_SET_MODE_RESOURCE_SET_DAWR1 for 2nd DAWR. Also, KVM will support 2nd DAWR only if CPU_FTR_DAWR1 is set. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
122954ed7 |
|
16-Dec-2020 |
Ravi Bangoria <ravi.bangoria@linux.ibm.com> |
KVM: PPC: Book3S HV: Rename current DAWR macros and variables Power10 is introducing a second DAWR (Data Address Watchpoint Register). Use real register names (with suffix 0) from ISA for current macros and variables used by kvm. One exception is KVM_REG_PPC_DAWR. Keep it as it is because it's uapi so changing it will break userspace. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
87fb4978 |
|
08-Dec-2020 |
Leonardo Bras <leobras.c@gmail.com> |
KVM: PPC: Book3S HV: Fix mask size for emulated msgsndp According to ISAv3.1 and ISAv3.0b, the msgsndp is described to split RB in: msgtype <- (RB) 32:36 payload <- (RB) 37:63 t <- (RB) 57:63 The current way of getting 'msgtype', and 't' is missing their MSB: msgtype: ((arg >> 27) & 0xf) : Gets (RB) 33:36, missing bit 32 t: (arg &= 0x3f) : Gets (RB) 58:63, missing bit 57 Fixes this by applying the correct mask. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201208215707.31149-1-leobras.c@gmail.com
|
#
1d15ffdf |
|
28-Nov-2020 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Ratelimit machine check messages coming from guests A number of machine check exceptions are triggerable by the guest. Ratelimit these to avoid a guest flooding the host console and logs. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Use dedicated ratelimit state, not printk_ratelimit()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201128070728.825934-5-npiggin@gmail.com
|
#
e8063940 |
|
06-Oct-2020 |
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> |
powerpc/mm: Update tlbiel loop on POWER10 With POWER10, single tlbiel instruction invalidates all the congruence class of the TLB and hence we need to issue only one tlbiel with SET=0. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201007053305.232879-1-aneesh.kumar@linux.ibm.com
|
#
a4f1d94e |
|
03-Oct-2020 |
Joe Perches <joe@perches.com> |
KVM: PPC: Book3S HV: Make struct kernel_param_ops definition const This should be const, so make it so. Signed-off-by: Joe Perches <joe@perches.com> Message-Id: <d130e88dd4c82a12d979da747cc0365c72c3ba15.1601770305.git.joe@perches.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
cf59eb13 |
|
21-Sep-2020 |
Wang Wensheng <wangwensheng4@huawei.com> |
KVM: PPC: Book3S: Fix symbol undeclared warnings Build the kernel with `C=2`: arch/powerpc/kvm/book3s_hv_nested.c:572:25: warning: symbol 'kvmhv_alloc_nested' was not declared. Should it be static? arch/powerpc/kvm/book3s_64_mmu_radix.c:350:6: warning: symbol 'kvmppc_radix_set_pte_at' was not declared. Should it be static? arch/powerpc/kvm/book3s_hv.c:3568:5: warning: symbol 'kvmhv_p9_guest_entry' was not declared. Should it be static? arch/powerpc/kvm/book3s_hv_rm_xics.c:767:15: warning: symbol 'eoi_rc' was not declared. Should it be static? arch/powerpc/kvm/book3s_64_vio_hv.c:240:13: warning: symbol 'iommu_tce_kill_rm' was not declared. Should it be static? arch/powerpc/kvm/book3s_64_vio.c:492:6: warning: symbol 'kvmppc_tce_iommu_do_map' was not declared. Should it be static? arch/powerpc/kvm/book3s_pr.c:572:6: warning: symbol 'kvmppc_set_pvr_pr' was not declared. Should it be static? Those symbols are used only in the files that define them so make them static to fix the warnings. Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
35dfb43c |
|
03-Sep-2020 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Set LPCR[HDICE] before writing HDEC POWER8 and POWER9 machines have a hardware deviation where generation of a hypervisor decrementer exception is suppressed if the HDICE bit in the LPCR register is 0 at the time when the HDEC register decrements from 0 to -1. When entering a guest, KVM first writes the HDEC register with the time until it wants the CPU to exit the guest, and then writes the LPCR with the guest value, which includes HDICE = 1. If HDEC decrements from 0 to -1 during the interval between those two events, it is possible that we can enter the guest with HDEC already negative but no HDEC exception pending, meaning that no HDEC interrupt will occur while the CPU is in the guest, or at least not until HDEC wraps around. Thus it is possible for the CPU to keep executing in the guest for a long time; up to about 4 seconds on POWER8, or about 4.46 years on POWER9 (except that the host kernel hard lockup detector will fire first). To fix this, we set the LPCR[HDICE] bit before writing HDEC on guest entry. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
05e6295d |
|
10-Sep-2020 |
Fabiano Rosas <farosas@linux.ibm.com> |
KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest The current nested KVM code does not support HPT guests. This is informed/enforced in some ways: - Hosts < P9 will not be able to enable the nested HV feature; - The nested hypervisor MMU capabilities will not contain KVM_CAP_PPC_MMU_HASH_V3; - QEMU reflects the MMU capabilities in the 'ibm,arch-vec-5-platform-support' device-tree property; - The nested guest, at 'prom_parse_mmu_model' ignores the 'disable_radix' kernel command line option if HPT is not supported; - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT. There is, however, still a way to start a HPT guest by using max-compat-cpu=power8 at the QEMU machine options. This leads to the guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB ioctl. With the guest set to hash, the nested hypervisor goes through the entry path that has no knowledge of nesting (kvmppc_run_vcpu) and crashes when it tries to execute an hypervisor-privileged (mtspr HDEC) instruction at __kvmppc_vcore_entry: root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ... <snip> [ 538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1 [ 538.543355] NIP: c00800000753f388 LR: c00800000753f368 CTR: c0000000001e5ec0 [ 538.543417] REGS: c0000013e91e33b0 TRAP: 0700 Not tainted (5.9.0-rc4) [ 538.543470] MSR: 8000000002843033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 22422882 XER: 20040000 [ 538.543546] CFAR: c00800000753f4b0 IRQMASK: 3 GPR00: c0080000075397a0 c0000013e91e3640 c00800000755e600 0000000080000000 GPR04: 0000000000000000 c0000013eab19800 c000001394de0000 00000043a054db72 GPR08: 00000000003b1652 0000000000000000 0000000000000000 c0080000075502e0 GPR12: c0000000001e5ec0 c0000007ffa74200 c0000013eab19800 0000000000000008 GPR16: 0000000000000000 c00000139676c6c0 c000000001d23948 c0000013e91e38b8 GPR20: 0000000000000053 0000000000000000 0000000000000001 0000000000000000 GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000001 GPR28: 0000000000000001 0000000000000053 c0000013eab19800 0000000000000001 [ 538.544067] NIP [c00800000753f388] __kvmppc_vcore_entry+0x90/0x104 [kvm_hv] [ 538.544121] LR [c00800000753f368] __kvmppc_vcore_entry+0x70/0x104 [kvm_hv] [ 538.544173] Call Trace: [ 538.544196] [c0000013e91e3640] [c0000013e91e3680] 0xc0000013e91e3680 (unreliable) [ 538.544260] [c0000013e91e3820] [c0080000075397a0] kvmppc_run_core+0xbc8/0x19d0 [kvm_hv] [ 538.544325] [c0000013e91e39e0] [c00800000753d99c] kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv] [ 538.544394] [c0000013e91e3ad0] [c0080000072da4fc] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 538.544472] [c0000013e91e3af0] [c0080000072d61b8] kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm] [ 538.544539] [c0000013e91e3b80] [c0080000072c7450] kvm_vcpu_ioctl+0x298/0x778 [kvm] [ 538.544605] [c0000013e91e3ce0] [c0000000004b8c2c] sys_ioctl+0x1dc/0xc90 [ 538.544662] [c0000013e91e3dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0 [ 538.544726] [c0000013e91e3e20] [c00000000000d140] system_call_common+0xf0/0x27c [ 538.544787] Instruction dump: [ 538.544821] f86d1098 60000000 60000000 48000099 e8ad0fe8 e8c500a0 e9264140 75290002 [ 538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 f90d10a0 480104fd [ 538.544953] ---[ end trace 74423e2b948c2e0c ]--- This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in the nested hypervisor, causing QEMU to abort. Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
dc462267 |
|
25-Aug-2020 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64s: handle ISA v3.1 local copy-paste context switches The ISA v3.1 the copy-paste facility has a new memory move functionality which allows the copy buffer to be pasted to domestic memory (RAM) as opposed to foreign memory (accelerator). This means the POWER9 trick of avoiding the cp_abort on context switch if the process had not mapped foreign memory does not work on POWER10. Do the cp_abort unconditionally there. KVM must also cp_abort on guest exit to prevent copy buffer state leaking between contexts. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200825075535.224536-1-npiggin@gmail.com
|
#
a2ce7200 |
|
27-Jul-2020 |
Laurent Dufour <ldufour@linux.ibm.com> |
KVM: PPC: Book3S HV: Migrate hot plugged memory When a memory slot is hot plugged to a SVM, PFNs associated with the GFNs in that slot must be migrated to the secure-PFNs, aka device-PFNs. Call kvmppc_uv_migrate_mem_slot() to accomplish this. Disable page-merge for all pages in the memory slot. Reviewed-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> [rearranged the code, and modified the commit log] Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
6f3fe297 |
|
23-Jul-2020 |
Ravi Bangoria <ravi.bangoria@linux.ibm.com> |
powerpc/watchpoint: Rename current H_SET_MODE DAWR macro Current H_SET_MODE hcall macro name for setting/resetting DAWR0 is H_SET_MODE_RESOURCE_SET_DAWR. Add suffix 0 to macro name as well. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Reviewed-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200723090813.303838-8-ravi.bangoria@linux.ibm.com
|
#
5752fe0b |
|
17-Jul-2020 |
Athira Rajeev <atrajeev@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Save/restore new PMU registers Power ISA v3.1 has added new performance monitoring unit (PMU) special purpose registers (SPRs). They are: Monitor Mode Control Register 3 (MMCR3) Sampled Instruction Event Register A (SIER2) Sampled Instruction Event Register B (SIER3) Add support to save/restore these new SPRs while entering/exiting guest. Also include changes to support KVM_REG_PPC_MMCR3/SIER2/SIER3. Add new SPRs to KVM API documentation. Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1594996707-3727-6-git-send-email-atrajeev@linux.vnet.ibm.com
|
#
7e4a145e |
|
17-Jul-2020 |
Athira Rajeev <atrajeev@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs Split this to give mmcra and mmcrs its own entries in vcpu and use a flat array for mmcr0 to mmcr2. This patch implements this cleanup to make code easier to read. Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> [mpe: Fix MMCRA/MMCR2 uapi breakage as noted by paulus] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1594996707-3727-3-git-send-email-atrajeev@linux.vnet.ibm.com
|
#
4cb4ade1 |
|
01-Jun-2020 |
Alistair Popple <alistair@popple.id.au> |
KVM: PPC: Book3SHV: Enable support for ISA v3.1 guests Adds support for emulating ISAv3.1 guests by adding the appropriate PCR and FSCR bits. Signed-off-by: Alistair Popple <alistair@popple.id.au> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
d6bdceb6 |
|
29-May-2020 |
Peter Zijlstra <peterz@infradead.org> |
powerpc64: Break asm/percpu.h vs spinlock_types.h dependency In order to use <asm/percpu.h> in lockdep.h, we need to make sure asm/percpu.h does not itself depend on lockdep. The below seems to make that so and builds powerpc64-defconfig + PROVE_LOCKING. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> https://lkml.kernel.org/r/20200623083721.336906073@infradead.org
|
#
d8ed45c5 |
|
08-Jun-2020 |
Michel Lespinasse <walken@google.com> |
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3fd5836e |
|
20-May-2020 |
Alistair Popple <alistair@popple.id.au> |
powerpc: Add support for ISA v3.1 Newer ISA versions are enabled by clearing all bits in the PCR associated with previous versions of the ISA. Enable ISA v3.1 support by updating the PCR mask to include ISA v3.0. This ensures all PCR bits corresponding to earlier architecture versions get cleared thereby enabling ISA v3.1 if supported by the hardware. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200521014341.29095-3-alistair@popple.id.au
|
#
e3326ae3 |
|
20-May-2020 |
Laurent Dufour <ldufour@linux.ibm.com> |
KVM: PPC: Book3S HV: Relax check on H_SVM_INIT_ABORT The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls") added checks of secure bit of SRR1 to filter out the Hcall reserved to the Ultravisor. However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the context of the VM calling UV_ESM. This allows the Hypervisor to return to the guest without going through the Ultravisor. Thus the Secure bit of SRR1 is not set in that particular case. In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is not set in that case. Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls") Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
8c99d345 |
|
26-Apr-2020 |
Tianjia Zhang <tianjia.zhang@linux.alibaba.com> |
KVM: PPC: Clean up redundant 'kvm_run' parameters In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu' structure. For historical reasons, many kvm-related function parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This patch does a unified cleanup of these remaining redundant parameters. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
2610a57f |
|
26-Apr-2020 |
Tianjia Zhang <tianjia.zhang@linux.alibaba.com> |
KVM: PPC: Remove redundant kvm_run from vcpu_arch The 'kvm_run' field already exists in the 'vcpu' structure, which is the same structure as the 'kvm_run' in the 'vcpu_arch' and should be deleted. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
09f82b06 |
|
14-May-2020 |
Ravi Bangoria <ravi.bangoria@linux.ibm.com> |
powerpc/watchpoint: Rename current DAWR macros Power10 is introducing second DAWR. Use real register names from ISA for current macros: s/SPRN_DAWR/SPRN_DAWR0/ s/SPRN_DAWRX/SPRN_DAWRX0/ Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Michael Neuling <mikey@neuling.org> Link: https://lore.kernel.org/r/20200514111741.97993-2-ravi.bangoria@linux.ibm.com
|
#
da4ad88c |
|
23-Apr-2020 |
Davidlohr Bueso <dave@stgolabs.net> |
kvm: Replace vcpu->swait with rcuwait The use of any sort of waitqueue (simple or regular) for wait/waking vcpus has always been an overkill and semantically wrong. Because this is per-vcpu (which is blocked) there is only ever a single waiting vcpu, thus no need for any sort of queue. As such, make use of the rcuwait primitive, with the following considerations: - rcuwait already provides the proper barriers that serialize concurrent waiter and waker. - Task wakeup is done in rcu read critical region, with a stable task pointer. - Because there is no concurrency among waiters, we need not worry about rcuwait_wait_event() calls corrupting the wait->task. As a consequence, this saves the locking done in swait when modifying the queue. This also applies to per-vcore wait for powerpc kvm-hv. The x86 tscdeadline_latency test mentioned in 8577370fb0cb ("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg, latency is reduced by around 15-20% with this change. Cc: Paul Mackerras <paulus@ozlabs.org> Cc: kvmarm@lists.cs.columbia.edu Cc: linux-mips@vger.kernel.org Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Message-Id: <20200424054837.5138-6-dave@stgolabs.net> [Avoid extra logic changes. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
9a5788c6 |
|
18-Mar-2020 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add a capability for enabling secure guests At present, on Power systems with Protected Execution Facility hardware and an ultravisor, a KVM guest can transition to being a secure guest at will. Userspace (QEMU) has no way of knowing whether a host system is capable of running secure guests. This will present a problem in future when the ultravisor is capable of migrating secure guests from one host to another, because virtualization management software will have no way to ensure that secure guests only run in domains where all of the hosts can support secure guests. This adds a VM capability which has two functions: (a) userspace can query it to find out whether the host can support secure guests, and (b) userspace can enable it for a guest, which allows that guest to become a secure guest. If userspace does not enable it, KVM will return an error when the ultravisor does the hypercall that indicates that the guest is starting to transition to a secure guest. The ultravisor will then abort the transition and the guest will terminate. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Ram Pai <linuxram@us.ibm.com>
|
#
8c47b6ff |
|
20-Mar-2020 |
Laurent Dufour <ldufour@linux.ibm.com> |
KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing prevent a malicious VM or SVM to call them. This could lead to weird result and should be filtered out. Checking the Secure bit of the calling MSR ensure that the call is coming from either the Ultravisor or a SVM. But any system call made from a SVM are going through the Ultravisor, and the Ultravisor should filter out these malicious call. This way, only the Ultravisor is able to make such a Hcall. Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Ram Pai <linuxram@us.ibnm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
6fef0c6b |
|
18-Mar-2020 |
Greg Kurz <groug@kaod.org> |
KVM: PPC: Kill kvmppc_ops::mmu_destroy() and kvmppc_mmu_destroy() These are only used by HV KVM and BookE, and in both cases they are nops. Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
1f50cc17 |
|
10-Mar-2020 |
Michael Roth <mdroth@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Fix H_CEDE return code for nested guests The h_cede_tm kvm-unit-test currently fails when run inside an L1 guest via the guest/nested hypervisor. ./run-tests.sh -v ... TESTNAME=h_cede_tm TIMEOUT=90s ACCEL= ./powerpc/run powerpc/tm.elf -smp 2,threads=2 -machine cap-htm=on -append "h_cede_tm" FAIL h_cede_tm (2 tests, 1 unexpected failures) While the test relates to transactional memory instructions, the actual failure is due to the return code of the H_CEDE hypercall, which is reported as 224 instead of 0. This happens even when no TM instructions are issued. 224 is the value placed in r3 to execute a hypercall for H_CEDE, and r3 is where the caller expects the return code to be placed upon return. In the case of guest running under a nested hypervisor, issuing H_CEDE causes a return from H_ENTER_NESTED. In this case H_CEDE is specially-handled immediately rather than later in kvmppc_pseries_do_hcall() as with most other hcalls, but we forget to set the return code for the caller, hence why kvm-unit-test sees the 224 return code and reports an error. Guest kernels generally don't check the return value of H_CEDE, so that likely explains why this hasn't caused issues outside of kvm-unit-tests so far. Fix this by setting r3 to 0 after we finish processing the H_CEDE. RHBZ: 1778556 Fixes: 4bad77799fed ("KVM: PPC: Book3S HV: Handle hypercalls correctly when nested") Cc: linuxppc-dev@ozlabs.org Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
4d395762 |
|
28-Feb-2020 |
Peter Xu <peterx@redhat.com> |
KVM: Remove unnecessary asm/kvm_host.h includes Remove includes of asm/kvm_host.h from files that already include linux/kvm_host.h to make it more obvious that there is no ordering issue between the two headers. linux/kvm_host.h includes asm/kvm_host.h to pick up architecture specific settings, and this will never change, i.e. including asm/kvm_host.h after linux/kvm_host.h may seem problematic, but in practice is simply redundant. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0577d1ab |
|
18-Feb-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: Terminate memslot walks via used_slots Refactor memslot handling to treat the number of used slots as the de facto size of the memslot array, e.g. return NULL from id_to_memslot() when an invalid index is provided instead of relying on npages==0 to detect an invalid memslot. Rework the sorting and walking of memslots in advance of dynamically sizing memslots to aid bisection and debug, e.g. with luck, a bug in the refactoring will bisect here and/or hit a WARN instead of randomly corrupting memory. Alternatively, a global null/invalid memslot could be returned, i.e. so callers of id_to_memslot() don't have to explicitly check for a NULL memslot, but that approach runs the risk of introducing difficult-to- debug issues, e.g. if the global null slot is modified. Constifying the return from id_to_memslot() to combat such issues is possible, but would require a massive refactoring of arch specific code and would still be susceptible to casting shenanigans. Add function comments to update_memslots() and search_memslots() to explicitly (and loudly) state how memslots are sorted. Opportunistically stuff @hva with a non-canonical value when deleting a private memslot on x86 to detect bogus usage of the freed slot. No functional change intended. Tested-by: Christoffer Dall <christoffer.dall@arm.com> Tested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
e96c81ee |
|
18-Feb-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: Simplify kvm_free_memslot() and all its descendents Now that all callers of kvm_free_memslot() pass NULL for @dont, remove the param from the top-level routine and all arch's implementations. No functional change intended. Tested-by: Christoffer Dall <christoffer.dall@arm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
82307e67 |
|
18-Feb-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: PPC: Move memslot memory allocation into prepare_memory_region() Allocate the rmap array during kvm_arch_prepare_memory_region() to pave the way for removing kvm_arch_create_memslot() altogether. Moving PPC's memory allocation only changes the order of kernel memory allocations between PPC and common KVM code. No functional change intended. Acked-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c4fd527f |
|
09-Feb-2020 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
powerpc/kvm: no need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Because of this cleanup, we get to remove a few fields in struct kvm_arch that are now unused. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [mpe: Fix build error in kvm/timing.c, adapt kvmppc_remove_cpu_debugfs()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200209105901.1620958-2-gregkh@linuxfoundation.org
|
#
ff030fdf |
|
18-Dec-2019 |
Sean Christopherson <seanjc@google.com> |
KVM: PPC: Move kvm_vcpu_init() invocation to common code Move the kvm_cpu_{un}init() calls to common PPC code as an intermediate step towards removing kvm_cpu_{un}init() altogether. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c50bfbdc |
|
18-Dec-2019 |
Sean Christopherson <seanjc@google.com> |
KVM: PPC: Allocate vcpu struct in common PPC code Move allocation of all flavors of PPC vCPUs to common PPC code. All variants either allocate 'struct kvm_vcpu' directly, or require that the embedded 'struct kvm_vcpu' member be located at offset 0, i.e. guarantee that the allocation can be directly interpreted as a 'struct kvm_vcpu' object. Remove the message from the build-time assertion regarding placement of the struct, as compatibility with the arch usercopy region is no longer the sole dependent on 'struct kvm_vcpu' being at offset zero. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
1a978d9d |
|
18-Dec-2019 |
Sean Christopherson <seanjc@google.com> |
KVM: PPC: Book3S HV: Uninit vCPU if vcore creation fails Call kvm_vcpu_uninit() if vcore creation fails to avoid leaking any resources allocated by kvm_vcpu_init(), i.e. the vcpu->run page. Fixes: 371fefd6f2dc4 ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes") Cc: stable@vger.kernel.org Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
3a43970d |
|
06-Jan-2020 |
Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Implement H_SVM_INIT_ABORT hcall Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to abort an SVM after it has issued the H_SVM_INIT_START and before the H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor encounters security violations or other errors when starting an SVM. Note that this hcall is different from UV_SVM_TERMINATE ucall which is used by HV to terminate/cleanup an VM that has becore secure. The H_SVM_INIT_ABORT basically undoes operations that were done since the H_SVM_INIT_START hcall - i.e page-out all the VM pages back to normal memory, and terminate the SVM. (If we do not bring the pages back to normal memory, the text/data of the VM would be stuck in secure memory and since the SVM did not go secure, its MSR_S bit will be clear and the VM wont be able to access its pages even to do a clean exit). Based on patches and discussion with Paul Mackerras, Ram Pai and Bharata Rao. Signed-off-by: Ram Pai <linuxram@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
ce477a7a |
|
19-Dec-2019 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
KVM: PPC: Add skip_page_out parameter to uvmem functions Add 'skip_page_out' parameter to kvmppc_uvmem_drop_pages() so the callers can specify whetheter or not to skip paging out pages. This will be needed in a follow-on patch that implements H_SVM_INIT_ABORT hcall. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
8a9c8925 |
|
26-Nov-2019 |
Leonardo Bras <leobras.c@gmail.com> |
KVM: PPC: Book3S: Replace current->mm by kvm->mm Given that in kvm_create_vm() there is: kvm->mm = current->mm; And that on every kvm_*_ioctl we have: if (kvm->mm != current->mm) return -EIO; I see no reason to keep using current->mm instead of kvm->mm. By doing so, we would reduce the use of 'global' variables on code, relying more in the contents of kvm struct. Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
d89c69f4 |
|
17-Dec-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't do ultravisor calls on systems without ultravisor Commit 22945688acd4 ("KVM: PPC: Book3S HV: Support reset of secure guest") added a call to uv_svm_terminate, which is an ultravisor call, without any check that the guest is a secure guest or even that the system has an ultravisor. On a system without an ultravisor, the ultracall will degenerate to a hypercall, but since we are not in KVM guest context, the hypercall will get treated as a system call, which could have random effects depending on what happens to be in r0, and could also corrupt the current task's kernel stack. Hence this adds a test for the guest being a secure guest before doing uv_svm_terminate(). Fixes: 22945688acd4 ("KVM: PPC: Book3S HV: Support reset of secure guest") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
22945688 |
|
24-Nov-2019 |
Bharata B Rao <bharata@linux.ibm.com> |
KVM: PPC: Book3S HV: Support reset of secure guest Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF. This ioctl will be issued by QEMU during reset and includes the the following steps: - Release all device pages of the secure guest. - Ask UV to terminate the guest via UV_SVM_TERMINATE ucall - Unpin the VPA pages so that they can be migrated back to secure side when guest becomes secure again. This is required because pinned pages can't be migrated. - Reinit the partition scoped page tables After these steps, guest is ready to issue UV_ESM call once again to switch to secure mode. Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> [Implementation of uv_svm_terminate() and its call from guest shutdown path] Signed-off-by: Ram Pai <linuxram@us.ibm.com> [Unpinning of VPA pages] Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
c3262257 |
|
24-Nov-2019 |
Bharata B Rao <bharata@linux.ibm.com> |
KVM: PPC: Book3S HV: Handle memory plug/unplug to secure VM Register the new memslot with UV during plug and unregister the memslot during unplug. In addition, release all the device pages during unplug. Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
ca9f4942 |
|
24-Nov-2019 |
Bharata B Rao <bharata@linux.ibm.com> |
KVM: PPC: Book3S HV: Support for running secure guests A pseries guest can be run as secure guest on Ultravisor-enabled POWER platforms. On such platforms, this driver will be used to manage the movement of guest pages between the normal memory managed by hypervisor (HV) and secure memory managed by Ultravisor (UV). HV is informed about the guest's transition to secure mode via hcalls: H_SVM_INIT_START: Initiate securing a VM H_SVM_INIT_DONE: Conclude securing a VM As part of H_SVM_INIT_START, register all existing memslots with the UV. H_SVM_INIT_DONE call by UV informs HV that transition of the guest to secure mode is complete. These two states (transition to secure mode STARTED and transition to secure mode COMPLETED) are recorded in kvm->arch.secure_guest. Setting these states will cause the assembly code that enters the guest to call the UV_RETURN ucall instead of trying to enter the guest directly. Migration of pages betwen normal and secure memory of secure guest is implemented in H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls. H_SVM_PAGE_IN: Move the content of a normal page to secure page H_SVM_PAGE_OUT: Move the content of a secure page to normal page Private ZONE_DEVICE memory equal to the amount of secure memory available in the platform for running secure guests is created. Whenever a page belonging to the guest becomes secure, a page from this private device memory is used to represent and track that secure page on the HV side. The movement of pages between normal and secure memory is done via migrate_vma_pages() using UV_PAGE_IN and UV_PAGE_OUT ucalls. In order to prevent the device private pages (that correspond to pages of secure guest) from participating in KSM merging, H_SVM_PAGE_IN calls ksm_madvise() under read version of mmap_sem. However ksm_madvise() needs to be under write lock. Hence we call kvmppc_svm_page_in with mmap_sem held for writing, and it then downgrades to a read lock after calling ksm_madvise. [paulus@ozlabs.org - roll in patch "KVM: PPC: Book3S HV: Take write mmap_sem when calling ksm_madvise"] Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
55d70042 |
|
02-Oct-2019 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Reject mflags=2 (LPCR[AIL]=2) ADDR_TRANS_MODE mode AIL=2 mode has no known users, so is not well tested or supported. Disallow guests from selecting this mode because it may become deprecated in future versions of the architecture. This policy decision is not left to QEMU because KVM support is required for AIL=2 (when injecting interrupts). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
268f4ef9 |
|
02-Oct-2019 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Reuse kvmppc_inject_interrupt for async guest delivery This consolidates the HV interrupt delivery logic into one place. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
87a45e07 |
|
02-Oct-2019 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S: Replace reset_msr mmu op with inject_interrupt arch op reset_msr sets the MSR for interrupt injection, but it's cleaner and more flexible to provide a single op to set both MSR and PC for the interrupt. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
13c7bb3c |
|
16-Sep-2019 |
Jordan Niethe <jniethe5@gmail.com> |
powerpc/64s: Set reserved PCR bits Currently the reserved bits of the Processor Compatibility Register (PCR) are cleared as per the Programming Note in Section 1.3.3 of version 3.0B of the Power ISA. This causes all new architecture features to be made available when running on newer processors with new architecture features added to the PCR as bits must be set to disable a given feature. For example to disable new features added as part of Version 2.07 of the ISA the corresponding bit in the PCR needs to be set. As new processor features generally require explicit kernel support they should be disabled until such support is implemented. Therefore kernels should set all unknown/reserved bits in the PCR such that any new architecture features which the kernel does not currently know about get disabled. An update is planned to the ISA to clarify that the PCR is an exception to the Programming Note on reserved bits in Section 1.3.3. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Tested-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190917004605.22471-2-alistair@popple.id.au
|
#
2275d7b5 |
|
02-Sep-2019 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64s/radix: introduce options to disable use of the tlbie instruction Introduce two options to control the use of the tlbie instruction. A boot time option which completely disables the kernel using the instruction, this is currently incompatible with HASH MMU, KVM, and coherent accelerators. And a debugfs option can be switched at runtime and avoids using tlbie for invalidating CPU TLBs for normal process and kernel address mappings. Coherent accelerators are still managed with tlbie, as will KVM partition scope translations. Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a basic implementation which does not attempt to make any optimisation beyond the tlbie implementation. This is useful for performance testing among other things. For example in certain situations on large systems, using IPIs may be faster than tlbie as they can be directed rather than broadcast. Later we may also take advantage of the IPIs to do more interesting things such as trim the mm cpumask more aggressively. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
|
#
ff42df49 |
|
26-Aug-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9 On POWER9, when userspace reads the value of the DPDES register on a vCPU, it is possible for 0 to be returned although there is a doorbell interrupt pending for the vCPU. This can lead to a doorbell interrupt being lost across migration. If the guest kernel uses doorbell interrupts for IPIs, then it could malfunction because of the lost interrupt. This happens because a newly-generated doorbell interrupt is signalled by setting vcpu->arch.doorbell_request to 1; the DPDES value in vcpu->arch.vcore->dpdes is not updated, because it can only be updated when holding the vcpu mutex, in order to avoid races. To fix this, we OR in vcpu->arch.doorbell_request when reading the DPDES value. Cc: stable@vger.kernel.org # v4.13+ Fixes: 579006944e0d ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
#
d28eafc5 |
|
26-Aug-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores When we are running multiple vcores on the same physical core, they could be from different VMs and so it is possible that one of the VMs could have its arch.mmu_ready flag cleared (for example by a concurrent HPT resize) when we go to run it on a physical core. We currently check the arch.mmu_ready flag for the primary vcore but not the flags for the other vcores that will be run alongside it. This adds that check, and also a check when we select the secondary vcores from the preempted vcores list. Cc: stable@vger.kernel.org # v4.14+ Fixes: 38c53af85306 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
c8b4083d |
|
02-Jul-2019 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Save and restore guest visible PSSCR bits on pseries The Performance Stop Status and Control Register (PSSCR) is used to control the power saving facilities of the processor. This register has various fields, some of which can be modified only in hypervisor state, and others which can be modified in both hypervisor and privileged non-hypervisor state. The bits which can be modified in privileged non-hypervisor state are referred to as guest visible. Currently the L0 hypervisor saves and restores both it's own host value as well as the guest value of the PSSCR when context switching between the hypervisor and guest. However a nested hypervisor running it's own nested guests (as indicated by kvmhv_on_pseries()) doesn't context switch the PSSCR register. That means if a nested (L2) guest modifies the PSSCR then the L1 guest hypervisor will run with that modified value, and if the L1 guest hypervisor modifies the PSSCR and then goes to run the nested (L2) guest again then the L2 PSSCR value will be lost. Fix this by having the (L1) nested hypervisor save and restore both its host and the guest PSSCR value when entering and exiting a nested (L2) guest. Note that only the guest visible parts of the PSSCR are context switched since this is all the L1 nested hypervisor can access, this is fine however as these are the only fields the L0 hypervisor provides guest control of anyway and so all other fields are ignored. This could also have been implemented by adding the PSSCR register to the hv_regs passed to the L0 hypervisor as input to the H_ENTER_NESTED hcall, however this would have meant updating the structure layout and thus required modifications to both the L0 and L1 kernels. Whereas the approach used doesn't require L0 kernel modifications while achieving the same result. Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190703012022.15644-3-sjitindarsingh@gmail.com
|
#
63279eeb |
|
02-Jul-2019 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting The performance monitoring unit (PMU) registers are saved on guest exit when the guest has set the pmcregs_in_use flag in its lppaca, if it exists, or unconditionally if it doesn't. If a nested guest is being run then the hypervisor doesn't, and in most cases can't, know if the PMU registers are in use since it doesn't know the location of the lppaca for the nested guest, although it may have one for its immediate guest. This results in the values of these registers being lost across nested guest entry and exit in the case where the nested guest was making use of the performance monitoring facility while it's nested guest hypervisor wasn't. Further more the hypervisor could interrupt a guest hypervisor between when it has loaded up the PMU registers and it calling H_ENTER_NESTED or between returning from the nested guest to the guest hypervisor and the guest hypervisor reading the PMU registers, in kvmhv_p9_guest_entry(). This means that it isn't sufficient to just save the PMU registers when entering or exiting a nested guest, but that it is necessary to always save the PMU registers whenever a guest is capable of running nested guests to ensure the register values aren't lost in the context switch. Ensure the PMU register values are preserved by always saving their value into the vcpu struct when a guest is capable of running nested guests. This should have minimal performance impact however any impact can be avoided by booting a guest with "-machine pseries,cap-nested-hv=false" on the qemu commandline. Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190703012022.15644-1-sjitindarsingh@gmail.com
|
#
3c25ab35 |
|
19-Jun-2019 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Clear pending decrementer exceptions on nested guest entry If we enter an L1 guest with a pending decrementer exception then this is cleared on guest exit if the guest has writtien a positive value into the decrementer (indicating that it handled the decrementer exception) since there is no other way to detect that the guest has handled the pending exception and that it should be dequeued. In the event that the L1 guest tries to run a nested (L2) guest immediately after this and the L2 guest decrementer is negative (which is loaded by L1 before making the H_ENTER_NESTED hcall), then the pending decrementer exception isn't cleared and the L2 entry is blocked since L1 has a pending exception, even though L1 may have already handled the exception and written a positive value for it's decrementer. This results in a loop of L1 trying to enter the L2 guest and L0 blocking the entry since L1 has an interrupt pending with the outcome being that L2 never gets to run and hangs. Fix this by clearing any pending decrementer exceptions when L1 makes the H_ENTER_NESTED hcall since it won't do this if it's decrementer has gone negative, and anyway it's decrementer has been communicated to L0 in the hdec_expires field and L0 will return control to L1 when this goes negative by delivering an H_DECREMENTER exception. Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
86953770 |
|
19-Jun-2019 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Signed extend decrementer value if not using large decrementer On POWER9 the decrementer can operate in large decrementer mode where the decrementer is 56 bits and signed extended to 64 bits. When not operating in this mode the decrementer behaves as a 32 bit decrementer which is NOT signed extended (as on POWER8). Currently when reading a guest decrementer value we don't take into account whether the large decrementer is enabled or not, and this means the value will be incorrect when the guest is not using the large decrementer. Fix this by sign extending the value read when the guest isn't using the large decrementer. Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
d2912cb1 |
|
04-Jun-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
d724c9e5 |
|
29-May-2019 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry() The sprgs are a set of 4 general purpose sprs provided for software use. SPRG3 is special in that it can also be read from userspace. Thus it is used on linux to store the cpu and numa id of the process to speed up syscall access to this information. This register is overwritten with the guest value on kvm guest entry, and so needs to be restored on exit again. Thus restore the value on the guest exit path in kvmhv_p9_guest_entry(). Cc: stable@vger.kernel.org # v4.20+ Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
1b28d553 |
|
27-May-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9 Commit 3309bec85e60 ("KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest") moved calls to trace_hardirqs_{on,off} in the entry path used for HPT guests. Similar code exists in the new streamlined entry path used for radix guests on POWER9. This makes the same change there, so as to avoid lockdep warnings such as this: [ 228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) [ 228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270 [ 228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat +xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter +ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf +uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs +zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor +raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea +sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core [ 228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42 [ 228.686911] NIP: c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20 [ 228.686963] REGS: c000200cdb50f570 TRAP: 0700 Not tainted (5.2.0-rc1-xive+) [ 228.687001] MSR: 9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 48222222 XER: 20040000 [ 228.687060] CFAR: c000000000116db0 IRQMASK: 1 [ 228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e [ 228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f [ 228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000 [ 228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000 [ 228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900 [ 228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074 [ 228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000 [ 228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0 [ 228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270 [ 228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270 [ 228.687466] Call Trace: [ 228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable) [ 228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0 [ 228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100 [ 228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340 [ 228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv] [ 228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv] [ 228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [ 228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm] [ 228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0 [ 228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120 [ 228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80 [ 228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70 [ 228.688109] Instruction dump: [ 228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87 [ 228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87 [ 228.688251] irq event stamp: 205 [ 228.688287] hardirqs last enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv] [ 228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv] [ 228.688412] softirqs last enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4 [ 228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210 [ 228.688513] ---[ end trace eb16f6260022a812 ]--- [ 228.688548] possible reason: unannotated irqs-off. [ 228.688571] irq event stamp: 205 [ 228.688607] hardirqs last enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv] [ 228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv] [ 228.688719] softirqs last enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4 [ 228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210 Cc: stable@vger.kernel.org # v4.20+ Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Tested-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
5a3f4936 |
|
23-May-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu Currently the HV KVM code takes the kvm->lock around calls to kvm_for_each_vcpu() and kvm_get_vcpu_by_id() (which can call kvm_for_each_vcpu() internally). However, that leads to a lock order inversion problem, because these are called in contexts where the vcpu mutex is held, but the vcpu mutexes nest within kvm->lock according to Documentation/virtual/kvm/locking.txt. Hence there is a possibility of deadlock. To fix this, we simply don't take the kvm->lock mutex around these calls. This is safe because the implementations of kvm_for_each_vcpu() and kvm_get_vcpu_by_id() have been designed to be able to be called locklessly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
0d4ee88d |
|
23-May-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setup Currently the HV KVM code uses kvm->lock in conjunction with a flag, kvm->arch.mmu_ready, to synchronize MMU setup and hold off vcpu execution until the MMU-related data structures are ready. However, this means that kvm->lock is being taken inside vcpu->mutex, which is contrary to Documentation/virtual/kvm/locking.txt and results in lockdep warnings. To fix this, we add a new mutex, kvm->arch.mmu_setup_lock, which nests inside the vcpu mutexes, and is taken in the places where kvm->lock was taken that are related to MMU setup. Additionally we take the new mutex in the vcpu creation code at the point where we are creating a new vcore, in order to provide mutual exclusion with kvmppc_update_lpcr() and ensure that an update to kvm->arch.lpcr doesn't get missed, which could otherwise lead to a stale vcore->lpcr value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
44b198ae |
|
29-Apr-2019 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry() On POWER9 and later processors where the host can schedule vcpus on a per thread basis, there is a streamlined entry path used when the guest is radix. This entry path saves/restores the fp and vr state in kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and load_[fp/vr]_state(). This is the same as the old entry path however the old entry path also saved/restored the VRSAVE register, which isn't done in the new entry path. This means that the vrsave register is now volatile across guest exit, which is an incorrect change in behaviour. Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry(). This restores the old, correct, behaviour. Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
70ea13f6 |
|
29-Apr-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Flush TLB on secondary radix threads When running on POWER9 with kvm_hv.indep_threads_mode = N and the host in SMT1 mode, KVM will run guest VCPUs on offline secondary threads. If those guests are in radix mode, we fail to load the LPID and flush the TLB if necessary, leading to the guest crashing with an unsupported MMU fault. This arises from commit 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on", 2018-05-17), which didn't consider the case where indep_threads_mode = N. For simplicity, this makes the real-mode guest entry path flush the TLB in the same place for both radix and hash guests, as we did before 9a4506e11b97, though the code is now C code rather than assembly code. We also have the radix TLB flush open-coded rather than calling radix__local_flush_tlb_lpid_guest(), because the TLB flush can be called in real mode, and in real mode we don't want to invoke the tracepoint code. Fixes: 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
6fabc9f2 |
|
25-Apr-2019 |
Palmer Dabbelt <palmer@sifive.com> |
KVM: PPC: Book3S HV: smb->smp comment fixup I made the same typo when trying to grep for uses of smp_wmb and figured I might as well fix it. Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
3309bec8 |
|
28-Mar-2019 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest The trace_hardirqs_on() sets current->hardirqs_enabled and from here the lockdep assumes interrupts are enabled although they are remain disabled until the context switches to the guest. Consequent srcu_read_lock() checks the flags in rcu_lock_acquire(), observes disabled interrupts and prints a warning (see below). This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to prevent lockdep from being confused. DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280 [...] NIP [c000000000185b84] check_flags.part.25+0x224/0x280 LR [c000000000185b80] check_flags.part.25+0x220/0x280 Call Trace: [c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable) [c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260 [c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv] [c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv] [c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm] [c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm] [c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930 [c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110 [c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80 [c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70 Instruction dump: 419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5 3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d irq event stamp: 1025 hardirqs last enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv] hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv] softirqs last enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00 softirqs last disabled at (0): [<0000000000000000>] (null) ---[ end trace 31180adcc848993e ]--- possible reason: unannotated irqs-off. irq event stamp: 1025 hardirqs last enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv] hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv] softirqs last enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00 softirqs last disabled at (0): [<0000000000000000>] (null) Fixes: 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
2d34d1c3 |
|
22-Mar-2019 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Implement virtual mode H_PAGE_INIT handler Implement a virtual mode handler for the H_CALL H_PAGE_INIT which can be used to zero or copy a guest page. The page is defined to be 4k and must be 4k aligned. The in-kernel handler halves the time to handle this H_CALL compared to handling it in userspace for a radix guest. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
c1fe190c |
|
01-Apr-2019 |
Michael Neuling <mikey@neuling.org> |
powerpc: Add force enable of DAWR on P9 option This adds a flag so that the DAWR can be enabled on P9 via: echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous The DAWR was previously force disabled on POWER9 in: 9654153158 powerpc: Disable DAWR in the base POWER9 CPU features Also see Documentation/powerpc/DAWR-POWER9.txt This is a dangerous setting, USE AT YOUR OWN RISK. Some users may not care about a bad user crashing their box (ie. single user/desktop systems) and really want the DAWR. This allows them to force enable DAWR. This flag can also be used to disable DAWR access. Once this is cleared, all DAWR access should be cleared immediately and your machine once again safe from crashing. Userspace may get confused by toggling this. If DAWR is force enabled/disabled between getting the number of breakpoints (via PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an inconsistent view of what's available. Similarly for guests. For the DAWR to be enabled in a KVM guest, the DAWR needs to be force enabled in the host AND the guest. For this reason, this won't work on POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the dawr_enable_dangerous file will fail if the hypervisor doesn't support writing the DAWR. To double check the DAWR is working, run this kernel selftest: tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c Any errors/failures/skips mean something is wrong. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
7cb9eb10 |
|
17-Mar-2019 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit There is a hardware bug in some POWER9 processors where a treclaim in fake suspend mode can cause an inconsistency in the XER[SO] bit across the threads of a core, the workaround being to force the core into SMT4 when doing the treclaim. The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a thread is in fake suspend or real suspend. The important difference here being that thread reconfiguration is blocked in real suspend but not fake suspend mode. When we exit a guest which was in fake suspend mode, we force the core into SMT4 while we do the treclaim in kvmppc_save_tm_hv(). However on the new exit path introduced with the function kvmhv_run_single_vcpu() we restore the host PSSCR before calling kvmppc_save_tm_hv() which means that if we were in fake suspend mode we put the thread into real suspend mode when we clear the PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration and the thread which is trying to get the core into SMT4 before it can do the treclaim spins forever since it itself is blocking thread reconfiguration. The result is that that core is essentially lost. This results in a trace such as: [ 93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4 [ 93.512905] NIP: c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000 [ 93.512908] REGS: c000003fffd2bd70 TRAP: 0100 Not tainted (5.0.0) [ 93.512908] MSR: 9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]> CR: 22222444 XER: 00000000 [ 93.512914] CFAR: c000000000098a5c IRQMASK: 3 [ 93.512915] PACATMSCRATCH: 0000000000000001 [ 93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004 [ 93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007 [ 93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004 [ 93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000 [ 93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000 [ 93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0 [ 93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000 [ 93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000 [ 93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0 [ 93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88 [ 93.512960] Call Trace: [ 93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable) [ 93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv] [ 93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv] [ 93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv] [ 93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [ 93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm] [ 93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0 [ 93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0 [ 93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80 [ 93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70 [ 93.512983] Instruction dump: [ 93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068 [ 93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214 To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call kvmppc_save_tm_hv() which will mean the core can get into SMT4 and perform the treclaim. Note kvmppc_save_tm_hv() clears the PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that. Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
e40542af |
|
20-Feb-2019 |
Jordan Niethe <jniethe5@gmail.com> |
KVM: PPC: Book3S HV: Fix build failure without IOMMU support Currently trying to build without IOMMU support will fail: (.text+0x1380): undefined reference to `kvmppc_h_get_tce' (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce' (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce' (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect' This happens because turning off IOMMU support will prevent book3s_64_vio_hv.c from being built because it is only built when SPAPR_TCE_IOMMU is set, which depends on IOMMU support. Fix it using ifdefs for the undefined references. Fixes: 76d837a4c0f9 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms") Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
c0577201 |
|
20-Feb-2019 |
Paul Mackerras <paulus@ozlabs.org> |
powerpc/64s: Better printing of machine check info for guest MCEs This adds an "in_guest" parameter to machine_check_print_event_info() so that we can avoid trying to translate guest NIP values into symbolic form using the host kernel's symbol table. Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
884dfb72 |
|
20-Feb-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Simplify machine check handling This makes the handling of machine check interrupts that occur inside a guest simpler and more robust, with less done in assembler code and in real mode. Now, when a machine check occurs inside a guest, we always get the machine check event struct and put a copy in the vcpu struct for the vcpu where the machine check occurred. We no longer call machine_check_queue_event() from kvmppc_realmode_mc_power7(), because on POWER8, when a vcpu is running on an offline secondary thread and we call machine_check_queue_event(), that calls irq_work_queue(), which doesn't work because the CPU is offline, but instead triggers the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which fires again and again because nothing clears the condition). All that machine_check_queue_event() actually does is to cause the event to be printed to the console. For a machine check occurring in the guest, we now print the event in kvmppc_handle_exit_hv() instead. The assembly code at label machine_check_realmode now just calls C code and then continues exiting the guest. We no longer either synthesize a machine check for the guest in assembly code or return to the guest without a machine check. The code in kvmppc_handle_exit_hv() is extended to handle the case where the guest is not FWNMI-capable. In that case we now always synthesize a machine check interrupt for the guest. Previously, if the host thinks it has recovered the machine check fully, it would return to the guest without any notification that the machine check had occurred. If the machine check was caused by some action of the guest (such as creating duplicate SLB entries), it is much better to tell the guest that it has caused a problem. Therefore we now always generate a machine check interrupt for guests that are not FWNMI-capable. Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
d976f680 |
|
20-Feb-2019 |
Michael Ellerman <mpe@ellerman.id.au> |
KVM: PPC: Book3S HV: Context switch AMR on Power9 kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9 when guest and host are both running with the Radix MMU. Currently in that path we don't save the host AMR (Authority Mask Register) value, and we always restore 0 on return to the host. That is OK at the moment because the AMR is not used for storage keys with the Radix MMU. However we plan to start using the AMR on Radix to prevent the kernel from reading/writing to userspace outside of copy_to/from_user(). In order to make that work we need to save/restore the AMR value. We only restore the value if it is different from the guest value, which is already in the register when we exit to the host. This should mean we rarely need to actually restore the value when running a modern Linux as a guest, because it will be using the same value as us. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Russell Currey <ruscur@russell.cc>
|
#
dee339b5 |
|
26-Jan-2019 |
Nir Weiner <nir.weiner@oracle.com> |
KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start grow_halt_poll_ns() have a strange behaviour in case (vcpu->halt_poll_ns != 0) && (vcpu->halt_poll_ns < halt_poll_ns_grow_start). In this case, vcpu->halt_poll_ns will be multiplied by grow factor (halt_poll_ns_grow) which will require several grow iteration in order to reach a value bigger than halt_poll_ns_grow_start. This means that growing vcpu->halt_poll_ns from value of 0 is slower than growing it from a positive value less than halt_poll_ns_grow_start. Which is misleading and inaccurate. Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns to halt_poll_ns_grow_start in any case that (vcpu->halt_poll_ns < halt_poll_ns_grow_start). Regardless if vcpu->halt_poll_ns is 0. use READ_ONCE to get a consistent number for all cases. Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
49113d36 |
|
26-Jan-2019 |
Nir Weiner <nir.weiner@oracle.com> |
KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial start value when raising up vcpu->halt_poll_ns. It actually sets the first timeout to the first polling session. This value has significant effect on how tolerant we are to outliers. On the standard case, higher value is better - we will spend more time in the polling busyloop, handle events/interrupts faster and result in better performance. But on outliers it puts us in a busy loop that does nothing. Even if the shrink factor is zero, we will still waste time on the first iteration. The optimal value changes between different workloads. It depends on outliers rate and polling sessions length. As this value has significant effect on the dynamic halt-polling algorithm, it should be configurable and exposed. Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
7fa08e71 |
|
26-Jan-2019 |
Nir Weiner <nir.weiner@oracle.com> |
KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns grow_halt_poll_ns() have a strange behavior in case (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0). In this case, vcpu->halt_pol_ns will be set to zero. That results in shrinking instead of growing. Fix issue by changing grow_halt_poll_ns() to not modify vcpu->halt_poll_ns in case halt_poll_ns_grow is zero Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
03f95332 |
|
04-Feb-2019 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE Currently, the KVM code assumes that if the host kernel is using the XIVE interrupt controller (the new interrupt controller that first appeared in POWER9 systems), then the in-kernel XICS emulation will use the XIVE hardware to deliver interrupts to the guest. However, this only works when the host is running in hypervisor mode and has full access to all of the XIVE functionality. It doesn't work in any nested virtualization scenario, either with PR KVM or nested-HV KVM, because the XICS-on-XIVE code calls directly into the native-XIVE routines, which are not initialized and cannot function correctly because they use OPAL calls, and OPAL is not available in a guest. This means that using the in-kernel XICS emulation in a nested hypervisor that is using XIVE as its interrupt controller will cause a (nested) host kernel crash. To fix this, we change most of the places where the current code calls xive_enabled() to select between the XICS-on-XIVE emulation and the plain XICS emulation to call a new function, xics_on_xive(), which returns false in a guest. However, there is a further twist. The plain XICS emulation has some functions which are used in real mode and access the underlying XICS controller (the interrupt controller of the host) directly. In the case of a nested hypervisor, this means doing XICS hypercalls directly. When the nested host is using XIVE as its interrupt controller, these hypercalls will fail. Therefore this also adds checks in the places where the XICS emulation wants to access the underlying interrupt controller directly, and if that is XIVE, makes the code use the virtual mode fallback paths, which call generic kernel infrastructure rather than doing direct XICS access. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
08434ab4 |
|
07-Jan-2019 |
wangbo <wdjjwb@163.com> |
KVM: PPC: Book3S HV: Replace kmalloc_node+memset with kzalloc_node Replace kmalloc_node and memset with kzalloc_node Signed-off-by: wangbo <wang.bo116@zte.com.cn> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
6ff887b8 |
|
13-Dec-2018 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2 A guest cannot access quadrants 1 or 2 as this would result in an exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a guest when it wants to perform an access to quadrants 1 or 2, for example when it wants to access memory for one of its nested guests. Also provide an implementation for the kvm-hv module. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
873db2cd |
|
13-Dec-2018 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest Allow for a device which is being emulated at L0 (the host) for an L1 guest to be passed through to a nested (L2) guest. The existing kvmppc_hv_emulate_mmio function can be used here. The main challenge is that for a load the result must be stored into the L2 gpr, not an L1 gpr as would normally be the case after going out to qemu to complete the operation. This presents a challenge as at this point the L2 gpr state has been written back into L1 memory. To work around this we store the address in L1 memory of the L2 gpr where the result of the load is to be stored and use the new io_gpr value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for which completion must be done when returning back into the kernel. Then in kvmppc_complete_mmio_load() the resultant value is written into L1 memory at the location of the indicated L2 gpr. Note that we don't currently let an L1 guest emulate a device for an L2 guest which is then passed through to an L3 guest. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
dceadcf9 |
|
13-Dec-2018 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct The kvmppc_ops struct is used to store function pointers to kvm implementation specific functions. Introduce two new functions load_from_eaddr and store_to_eaddr to be used to load from and store to a guest effective address respectively. Also implement these for the kvm-hv module. If we are using the radix mmu then we can call the functions to access quadrant 1 and 2. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
5af3e9d0 |
|
11-Dec-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/off This adds code to flush the partition-scoped page tables for a radix guest when dirty tracking is turned on or off for a memslot. Only the guest real addresses covered by the memslot are flushed. The reason for this is to get rid of any 2M PTEs in the partition-scoped page tables that correspond to host transparent huge pages, so that page dirtiness is tracked at a system page (4k or 64k) granularity rather than a 2M granularity. The page tables are also flushed when turning dirty tracking off so that the memslot's address space can be repopulated with THPs if possible. To do this, we add a new function kvmppc_radix_flush_memslot(). Since this does what's needed for kvmppc_core_flush_memslot_hv() on a radix guest, we now make kvmppc_core_flush_memslot_hv() call the new kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix() for each page in the memslot. This has the effect of fixing a bug in that kvmppc_core_flush_memslot_hv() was previously calling kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which is required to be held. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
f032b734 |
|
11-Dec-2018 |
Bharata B Rao <bharata@linux.ibm.com> |
KVM: PPC: Pass change type down to memslot commit function Currently, kvm_arch_commit_memory_region() gets called with a parameter indicating what type of change is being made to the memslot, but it doesn't pass it down to the platform-specific memslot commit functions. This adds the `change' parameter to the lower-level functions so that they can use it in future. [paulus@ozlabs.org - fix book E also.] Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
234ff0b7 |
|
16-Nov-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch Testing has revealed an occasional crash which appears to be caused by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv. The symptom is a NULL pointer dereference in __find_linux_pte() called from kvm_unmap_radix() with kvm->arch.pgtable == NULL. Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear kvm->arch.pgtable (via kvmppc_free_radix()) before setting kvm->arch.radix to NULL, and there is nothing to prevent kvm_unmap_hva_range_hv() or the other MMU callback functions from being called concurrently with kvmppc_switch_mmu_to_hpt() or kvmppc_switch_mmu_to_radix(). This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock around the assignments to kvm->arch.radix, and makes sure that the partition-scoped radix tree or HPT is only freed after changing kvm->arch.radix. This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure that the clearing of each rmap array (one per memslot) doesn't happen concurrently with use of the array in the kvm_unmap_hva_range_hv() or the other MMU callbacks. Fixes: 18c3640cefc7 ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
6c08ec12 |
|
08-Nov-2018 |
Michael Roth <mdroth@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a pending signal in the L0 QEMU process can generate the following sequence: ret0 = kvmppc_pseries_do_hcall() ret1 = kvmhv_enter_nested_guest() ret2 = kvmhv_run_single_vcpu() if (ret2 == -EINTR) return H_INTERRUPT if (ret1 == H_INTERRUPT) kvmppc_set_gpr(vcpu, 3, 0) return -EINTR /* skipped: */ kvmppc_set_gpr(vcpu, 3, ret) vcpu->arch.hcall_needed = 0 return RESUME_GUEST which causes an exit to L0 userspace with ret0 == -EINTR. The intention seems to be to set the hcall return value to 0 (via VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED once we resume executing the VCPU. However, because we don't set vcpu->arch.hcall_needed = 0, we do the following once userspace resumes execution via kvm_arch_vcpu_ioctl_run(): ... } else if (vcpu->arch.hcall_needed) { int i kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret); for (i = 0; i < 9; ++i) kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]); vcpu->arch.hcall_needed = 0; since vcpu->arch.hcall_needed == 1 indicates that userspace should have handled the hcall and stored the return value in run->papr_hcall.ret. Since that's not the case here, we can get an unexpected value in VCPU r3, which can result in kvmhv_p9_guest_entry() reporting an unexpected trap value when it returns from H_ENTER_NESTED, causing the following register dump to console via subsequent call to kvmppc_handle_exit_hv() in L1: [ 350.612854] vcpu 00000000f9564cf8 (0): [ 350.612915] pc = c00000000013eb98 msr = 8000000000009033 trap = 1 [ 350.613020] r 0 = c0000000004b9044 r16 = 0000000000000000 [ 350.613075] r 1 = c00000007cffba30 r17 = 0000000000000000 [ 350.613120] r 2 = c00000000178c100 r18 = 00007fffc24f3b50 [ 350.613166] r 3 = c00000007ef52480 r19 = 00007fffc24fff58 [ 350.613212] r 4 = 0000000000000000 r20 = 00000a1e96ece9d0 [ 350.613253] r 5 = 70616d00746f6f72 r21 = 00000a1ea117c9b0 [ 350.613295] r 6 = 0000000000000020 r22 = 00000a1ea1184360 [ 350.613338] r 7 = c0000000783be440 r23 = 0000000000000003 [ 350.613380] r 8 = fffffffffffffffc r24 = 00000a1e96e9e124 [ 350.613423] r 9 = c00000007ef52490 r25 = 00000000000007ff [ 350.613469] r10 = 0000000000000004 r26 = c00000007eb2f7a0 [ 350.613513] r11 = b0616d0009eccdb2 r27 = c00000007cffbb10 [ 350.613556] r12 = c0000000004b9000 r28 = c00000007d83a2c0 [ 350.613597] r13 = c000000001b00000 r29 = c0000000783cdf68 [ 350.613639] r14 = 0000000000000000 r30 = 0000000000000000 [ 350.613681] r15 = 0000000000000000 r31 = c00000007cffbbf0 [ 350.613723] ctr = c0000000004b9000 lr = c0000000004b9044 [ 350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033 [ 350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000 [ 350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000 [ 350.613911] cr = 88002848 xer = 0000000020040000 dsisr = 42000000 [ 350.613962] dar = 0000772f95390000 [ 350.614031] fault dar = c000000244b278c0 dsisr = 00000000 [ 350.614073] SLB (0 entries): [ 350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff [ 350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033 followed by L1's QEMU reporting the following before stopping execution of the nested guest: KVM: unknown exit, hardware reason 1 NIP c00000000013eb98 LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0 MSR 8000000000009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3 TB 00000000 00000000 DECR 00000000 GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480 GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440 GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2 GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000 GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58 GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003 GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10 GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0 CR 88002848 [ L L - - E L G L ] RES ffffffffffffffff SRR0 0000772f954dd48c SRR1 800000000280f033 PVR 00000000004e1202 VRSAVE 0000000000000000 SPRG0 0000000000000000 SPRG1 c000000001b00000 SPRG2 0000772f9565a280 SPRG3 0000000000000000 SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000 HSRR0 0000000000000000 HSRR1 0000000000000000 CFAR 0000000000000000 LPCR 0000000003d40413 PTCR 0000000000000000 DAR 0000772f95390000 DSISR 0000000042000000 Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion of H_ENTER_NESTED before we exit to L0 userspace. Fixes: 360cae313702 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall") Cc: linuxppc-dev@ozlabs.org Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
c43befca |
|
20-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Use exported tb_to_ns() function in decrementer emulation This changes the KVM code that emulates the decrementer function to do the conversion of decrementer values to time intervals in nanoseconds by calling the tb_to_ns() function exported by the powerpc timer code, in preference to open-coded arithmetic using values from the decrementer_clockevent struct. Similarly, the HV-KVM code that did the same conversion using arithmetic on tb_ticks_per_sec also now uses tb_to_ns(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
8d9fcacf |
|
19-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips This disables the use of the streamlined entry path for radix guests on early POWER9 chips that need the workaround added in commit a25bd72badfa ("powerpc/mm/radix: Workaround prefetch issue with KVM", 2017-07-24), because the streamlined entry path does not include that workaround. This also means that we can't do nested HV-KVM on those chips. Since the chips that need that workaround are the same ones that can't run both radix and HPT guests at the same time on different threads of a core, we use the existing 'no_mixing_hpt_and_radix' variable that identifies those chips to identify when we can't use the new guest entry path, and when we can't do nested virtualization. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
901f8c3f |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result This adds a KVM_PPC_NO_HASH flag to the flags field of the kvm_ppc_smmu_info struct, and arranges for it to be set when running as a nested hypervisor, as an unambiguous indication to userspace that HPT guests are not supported. Reporting the KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as indicating only that the new HPT features in ISA V3.0 are not supported, leaving it ambiguous whether pre-V3.0 HPT features are supported. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
aa069a99 |
|
21-Sep-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization With this, userspace can enable a KVM-HV guest to run nested guests under it. The administrator can control whether any nested guests can be run; setting the "nested" module parameter to false prevents any guests becoming nested hypervisors (that is, any attempt to enable the nested capability on a guest will fail). Guests which are already nested hypervisors will continue to be so. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
de760db4 |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode With this, the KVM-HV module can be loaded in a guest running under KVM-HV, and if the hypervisor supports nested virtualization, this guest can now act as a nested hypervisor and run nested guests. This also adds some checks to inform userspace that HPT guests are not supported by nested hypervisors (by returning false for the KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from configuring a guest to use HPT mode. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
30323418 |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register This adds a one-reg register identifier which can be used to read and set the virtual PTCR for the guest. This register identifies the address and size of the virtual partition table for the guest, which contains information about the nested guests under this guest. Migrating this value is the only extra requirement for migrating a guest which has nested guests (assuming of course that the destination host supports nested virtualization in the kvm-hv module). Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
f3c99f97 |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested When running as a nested hypervisor, this avoids reading hypervisor privileged registers (specifically HFSCR, LPIDR and LPCR) at startup; instead reasonable default values are used. This also avoids writing LPIDR in the single-vcpu entry/exit path. Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init() since its only caller already checks this. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
9d0b048d |
|
07-Oct-2018 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu This is only done at level 0, since only level 0 knows which physical CPU a vcpu is running on. This does for nested guests what L0 already did for its own guests, which is to flush the TLB on a pCPU when it goes to run a vCPU there, and there is another vCPU in the same VM which previously ran on this pCPU and has now started to run on another pCPU. This is to handle the situation where the other vCPU touched a mapping, moved to another pCPU and did a tlbiel (local-only tlbie) on that new pCPU and thus left behind a stale TLB entry on this pCPU. This introduces a limit on the the vcpu_token values used in the H_ENTER_NESTED hcall -- they must now be less than NR_CPUS. [paulus@ozlabs.org - made prev_cpu array be short[] to reduce memory consumption.] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
e3b6b466 |
|
07-Oct-2018 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall When running a nested (L2) guest the guest (L1) hypervisor will use the H_TLB_INVALIDATE hcall when it needs to change the partition scoped page tables or the partition table which it manages. It will use this hcall in the situations where it would use a partition-scoped tlbie instruction if it were running in hypervisor mode. The H_TLB_INVALIDATE hcall can invalidate different scopes: Invalidate TLB for a given target address: - This invalidates a single L2 -> L1 pte - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2 address space which is being invalidated. This is because a single L2 -> L1 pte may have been mapped with more than one pte in the L2 -> L0 page tables. Invalidate the entire TLB for a given LPID or for all LPIDs: - Invalidate the entire shadow_pgtable for a given nested guest, or for all nested guests. Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs: - We don't cache the PWC, so nothing to do. Invalidate the entire TLB, PWC and partition table for a given/all LPIDs: - Here we re-read the partition table entry and remove the nested state for any nested guest for which the first doubleword of the partition table entry is now zero. The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction word (of which only the RIC, PRS and R fields are used), the rS value (giving the lpid, where required) and the rB value (giving the IS, AP and EPN values). [paulus@ozlabs.org - adapted to having the partition table in guest memory, added the H_TLB_INVALIDATE implementation, removed tlbie instruction emulation, reworded the commit message.] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
8cf531ed |
|
07-Oct-2018 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings When a host (L0) page which is mapped into a (L1) guest is in turn mapped through to a nested (L2) guest we keep a reverse mapping (rmap) so that these mappings can be retrieved later. Whenever we create an entry in a shadow_pgtable for a nested guest we create a corresponding rmap entry and add it to the list for the L1 guest memslot at the index of the L1 guest page it maps. This means at the L1 guest memslot we end up with lists of rmaps. When we are notified of a host page being invalidated which has been mapped through to a (L1) guest, we can then walk the rmap list for that guest page, and find and invalidate all of the corresponding shadow_pgtable entries. In order to reduce memory consumption, we compress the information for each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits for the guest real page frame number -- which will fit in a single unsigned long. To avoid a scenario where a guest can trigger unbounded memory allocations, we scan the list when adding an entry to see if there is already an entry with the contents we need. This can occur, because we don't ever remove entries from the middle of a list. A struct nested guest rmap is a list pointer and an rmap entry; ---------------- | next pointer | ---------------- | rmap entry | ---------------- Thus the rmap pointer for each guest frame number in the memslot can be either NULL, a single entry, or a pointer to a list of nested rmap entries. gfn memslot rmap array ------------------------- 0 | NULL | (no rmap entry) ------------------------- 1 | single rmap entry | (rmap entry with low bit set) ------------------------- 2 | list head pointer | (list of rmap entries) ------------------------- The final entry always has the lowest bit set and is stored in the next pointer of the last list entry, or as a single rmap entry. With a list of rmap entries looking like; ----------------- ----------------- ------------------------- | list head ptr | ----> | next pointer | ----> | single rmap entry | ----------------- ----------------- ------------------------- | rmap entry | | rmap entry | ----------------- ------------------------- Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
4bad7779 |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Handle hypercalls correctly when nested When we are running as a nested hypervisor, we use a hypercall to enter the guest rather than code in book3s_hv_rmhandlers.S. This means that the hypercall handlers listed in hcall_real_table never get called. There are some hypercalls that are handled there and not in kvmppc_pseries_do_hcall(), which therefore won't get processed for a nested guest. To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those hypercalls, with the following exceptions: - The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because we only support radix mode for nested guests. - H_CEDE has to be handled specially because the cede logic in kvmhv_run_single_vcpu assumes that it has been processed by the time that kvmhv_p9_guest_entry() returns. Therefore we put a special case for H_CEDE in kvmhv_p9_guest_entry(). For the XICS hypercalls, if real-mode processing is enabled, then the virtual-mode handlers assume that they are being called only to finish up the operation. Therefore we turn off the real-mode flag in the XICS code when running as a nested hypervisor. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
f3c18e93 |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor This adds code to call the H_IPI and H_EOI hypercalls when we are running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu feature) and we would otherwise access the XICS interrupt controller directly or via an OPAL call. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
360cae31 |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Nested guest entry via hypercall This adds a new hypercall, H_ENTER_NESTED, which is used by a nested hypervisor to enter one of its nested guests. The hypercall supplies register values in two structs. Those values are copied by the level 0 (L0) hypervisor (the one which is running in hypervisor mode) into the vcpu struct of the L1 guest, and then the guest is run until an interrupt or error occurs which needs to be reported to L1 via the hypercall return value. Currently this assumes that the L0 and L1 hypervisors are the same endianness, and the structs passed as arguments are in native endianness. If they are of different endianness, the version number check will fail and the hcall will be rejected. Nested hypervisors do not support indep_threads_mode=N, so this adds code to print a warning message if the administrator has set indep_threads_mode=N, and treat it as Y. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
8e3f5fc1 |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization This starts the process of adding the code to support nested HV-style virtualization. It defines a new H_SET_PARTITION_TABLE hypercall which a nested hypervisor can use to set the base address and size of a partition table in its memory (analogous to the PTCR register). On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE hypercall from the guest is handled by code that saves the virtual PTCR value for the guest. This also adds code for creating and destroying nested guests and for reading the partition table entry for a nested guest from L1 memory. Each nested guest has its own shadow LPID value, different in general from the LPID value used by the nested hypervisor to refer to it. The shadow LPID value is allocated at nested guest creation time. Nested hypervisor functionality is only available for a radix guest, which therefore means a radix host on a POWER9 (or later) processor. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
89329c0b |
|
07-Oct-2018 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Clear partition table entry on vm teardown When destroying a VM we return the LPID to the pool, however we never zero the partition table entry. This is instead done when we reallocate the LPID. Zero the partition table entry on VM teardown before returning the LPID to the pool. This means if we were running as a nested hypervisor the real hypervisor could use this to determine when it can free resources. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
fd0944ba |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct When the 'regs' field was added to struct kvm_vcpu_arch, the code was changed to use several of the fields inside regs (e.g., gpr, lr, etc.) but not the ccr field, because the ccr field in struct pt_regs is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is only 32 bits. This changes the code to use the regs.ccr field instead of cr, and changes the assembly code on 64-bit platforms to use 64-bit loads and stores instead of 32-bit ones. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
9a94d3ee |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings This adds a file called 'radix' in the debugfs directory for the guest, which when read gives all of the valid leaf PTEs in the partition-scoped radix tree for a radix guest, in human-readable format. It is analogous to the existing 'htab' file which dumps the HPT entries for a HPT guest. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
32eb150a |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Handle hypervisor instruction faults better Currently the code for handling hypervisor instruction page faults passes 0 for the flags indicating the type of fault, which is OK in the usual case that the page is not mapped in the partition-scoped page tables. However, there are other causes for hypervisor instruction page faults, such as not being to update a reference (R) or change (C) bit. The cause is indicated in bits in HSRR1, including a bit which indicates that the fault is due to not being able to write to a page (for example to update an R or C bit). Not handling these other kinds of faults correctly can lead to a loop of continual faults without forward progress in the guest. In order to handle these faults better, this patch constructs a "DSISR-like" value from the bits which DSISR and SRR1 (for a HISI) have in common, and passes it to kvmppc_book3s_hv_page_fault() so that it knows what caused the fault. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
95a6432c |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests This creates an alternative guest entry/exit path which is used for radix guests on POWER9 systems when we have indep_threads_mode=Y. In these circumstances there is exactly one vcpu per vcore and there is no coordination required between vcpus or vcores; the vcpu can enter the guest without needing to synchronize with anything else. The new fast path is implemented almost entirely in C in book3s_hv.c and runs with the MMU on until the guest is entered. On guest exit we use the existing path until the point where we are committed to exiting the guest (as distinct from handling an interrupt in the low-level code and returning to the guest) and we have pulled the guest context from the XIVE. At that point we check a flag in the stack frame to see whether we came in via the old path and the new path; if we came in via the new path then we go back to C code to do the rest of the process of saving the guest context and restoring the host context. The C code is split into separate functions for handling the OS-accessible state and the hypervisor state, with the idea that the latter can be replaced by a hypercall when we implement nested virtualization. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [mpe: Fix CONFIG_ALTIVEC=n build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
53655ddd |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked Currently kvmppc_handle_exit_hv() is called with the vcore lock held because it is called within a for_each_runnable_thread loop. However, we already unlock the vcore within kvmppc_handle_exit_hv() under certain circumstances, and this is safe because (a) any vcpus that become runnable and are added to the runnable set by kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually run in the guest (because the vcore state is VCORE_EXITING), and (b) for_each_runnable_thread is safe against addition or removal of vcpus from the runnable set. Therefore, in order to simplify things for following patches, let's drop the vcore lock in the for_each_runnable_thread loop, so kvmppc_handle_exit_hv() gets called without the vcore lock held. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
f7035ce9 |
|
07-Oct-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code This is based on a patch by Suraj Jitindar Singh. This moves the code in book3s_hv_rmhandlers.S that generates an external, decrementer or privileged doorbell interrupt just before entering the guest to C code in book3s_hv_builtin.c. This is to make future maintenance and modification easier. The algorithm expressed in the C code is almost identical to the previous algorithm. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
aa227864 |
|
11-Sep-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same VM This adds a mode where the vcore scheduling logic in HV KVM limits itself to scheduling only virtual cores from the same VM on any given physical core. This is enabled via a new module parameter on the kvm-hv module called "one_vm_per_core". For this to work on POWER9, it is necessary to set indep_threads_mode=N. (On POWER8, hardware limitations mean that KVM is never in independent threads mode, regardless of the indep_threads_mode setting.) Thus the settings needed for this to work are: 1. The host is in SMT1 mode. 2. On POWER8, the host is not in 2-way or 4-way static split-core mode. 3. On POWER9, the indep_threads_mode parameter is N. 4. The one_vm_per_core parameter is Y. With these settings, KVM can run up to 4 vcpus on a core at the same time on POWER9, or up to 8 vcpus on POWER8 (depending on the guest threading mode), and will ensure that all of the vcpus belong to the same VM. This is intended for use in security-conscious settings where users are concerned about possible side-channel attacks between threads which could perhaps enable one VM to attack another VM on the same core, or the host. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
d6ee76d3 |
|
16-Aug-2018 |
Luke Dashjr <luke@dashjr.org> |
powerpc64/ftrace: Include ftrace.h needed for enable/disable calls this_cpu_disable_ftrace and this_cpu_enable_ftrace are inlines in ftrace.h Without it included, the build fails. Fixes: a4bc64d305af ("powerpc64/ftrace: Disable ftrace during kvm entry/exit") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org> Acked-by: Naveen N. Rao <naveen.n.rao at linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
45ef5992 |
|
05-Jul-2018 |
Christophe Leroy <christophe.leroy@c-s.fr> |
powerpc: remove unnecessary inclusion of asm/tlbflush.h asm/tlbflush.h is only needed for: - using functions xxx_flush_tlb_xxx() - using MMU_NO_CONTEXT - including asm-generic/pgtable.h Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
b5c6f760 |
|
25-Jul-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Read kvm->arch.emul_smt_mode under kvm->lock Commit 1e175d2 ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode before any VCPUs are created. However, userspace can change kvm->arch.emul_smt_mode at any time up until the first VCPU is created. Hence it is (theoretically) possible for the check in kvmppc_core_vcpu_create_hv() to race with another userspace thread changing kvm->arch.emul_smt_mode. This fixes it by moving the test that uses kvm->arch.emul_smt_mode into the block where kvm->lock is held. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
1e175d2e |
|
25-Jul-2018 |
Sam Bobroff <sam.bobroff@au1.ibm.com> |
KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space It is not currently possible to create the full number of possible VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer threads per core than its core stride (or "VSMT mode"). This is because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS even though the VCPU ID is less than KVM_MAX_VCPU_ID. To address this, "pack" the VCORE ID and XIVE offsets by using knowledge of the way the VCPU IDs will be used when there are fewer guest threads per core than the core stride. The primary thread of each core will always be used first. Then, if the guest uses more than one thread per core, these secondary threads will sequentially follow the primary in each core. So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the VCPUs are being spaced apart, so at least half of each core is empty, and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped into the second half of each core (4..7, in an 8-thread core). Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of each core is being left empty, and we can map down into the second and third quarters of each core (2, 3 and 5, 6 in an 8-thread core). Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary threads are being used and 7/8 of the core is empty, allowing use of the 1, 5, 3 and 7 thread slots. (Strides less than 8 are handled similarly.) This allows the VCORE ID or offset to be calculated quickly from the VCPU ID or XIVE server numbers, without access to the VCPU structure. [paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE to pr_devel, wrapped line, fixed id check.] Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
0abb75b7 |
|
07-Jul-2018 |
Nicholas Mc Guire <hofrat@osadl.org> |
KVM: PPC: Book3S HV: Fix constant size warning The constants are 64bit but not explicitly declared UL resulting in sparse warnings. Fix this by declaring the constants UL. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
51eaa08f |
|
07-Jul-2018 |
Nicholas Mc Guire <hofrat@osadl.org> |
KVM: PPC: Book3S HV: Add of_node_put() in success path The call to of_find_compatible_node() is returning a pointer with incremented refcount so it must be explicitly decremented after the last use. As here it is only being used for checking of node presence but the result is not actually used in the success path it can be dropped immediately. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Fixes: commit f725758b899f ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
2bf1071a |
|
05-Jul-2018 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64s: Remove POWER9 DD1 support POWER9 DD1 was never a product. It is no longer supported by upstream firmware, and it is not effectively supported in Linux due to lack of testing. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [mpe: Remove arch_make_huge_pte() entirely] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
b3dae109 |
|
12-Jun-2018 |
Peter Zijlstra <peterz@infradead.org> |
sched/swait: Rename to exclusive Since swait basically implemented exclusive waits only, make sure the API reflects that. $ git grep -l -e "\<swake_up\>" -e "\<swait_event[^ (]*" -e "\<prepare_to_swait\>" | while read file; do sed -i -e 's/\<swake_up\>/&_one/g' -e 's/\<swait_event[^ (]*/&_exclusive/g' -e 's/\<prepare_to_swait\>/&_exclusive/g' $file; done With a few manual touch-ups. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: bigeasy@linutronix.de Cc: oleg@redhat.com Cc: paulmck@linux.vnet.ibm.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180612083909.261946548@infradead.org
|
#
fad953ce |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: Use array_size() in vzalloc() The vzalloc() function has no 2-factor argument form, so multiplication factors need to be wrapped in array_size(). This patch replaces cases of: vzalloc(a * b) with: vzalloc(array_size(a, b)) as well as handling cases of: vzalloc(a * b * c) with: vzalloc(array3_size(a, b, c)) This does, however, attempt to ignore constant size factors like: vzalloc(4 * 1024) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( vzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | vzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( vzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(char) * COUNT + COUNT , ...) | vzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( vzalloc( - sizeof(TYPE) * (COUNT_ID) + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_ID + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_CONST + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT_ID) + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_ID + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_CONST + array_size(COUNT_CONST, sizeof(THING)) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ vzalloc( - SIZE * COUNT + array_size(COUNT, SIZE) , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( vzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( vzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( vzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( vzalloc(C1 * C2 * C3, ...) | vzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants. @@ expression E1, E2; constant C1, C2; @@ ( vzalloc(C1 * C2, ...) | vzalloc( - E1 * E2 + array_size(E1, E2) , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
929f45e3 |
|
29-May-2018 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
kvm: no need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. This cleans up the error handling a lot, as this code will never get hit. Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christoffer Dall <christoffer.dall@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim KrÄmář" <rkrcmar@redhat.com> Cc: Arvind Yadav <arvind.yadav.cs@gmail.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Andre Przywara <andre.przywara@arm.com> Cc: kvm-ppc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: kvmarm@lists.cs.columbia.edu Cc: kvm@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
9a4506e1 |
|
17-May-2018 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on The radix guest code can has fewer restrictions about what context it can run in, so move this flushing out of assembly and have it use the Linux TLB flush implementations introduced previously. This allows powerpc:tlbie trace events to be used. This changes the tlbiel sequence to only execute RIC=2 flush once on the first set flushed, then RIC=0 for the rest of the sets. The end result of the flush should be unchanged. This matches the local PID flush pattern that was introduced in a5998fcb92 ("powerpc/mm/radix: Optimise tlbiel flush all case"). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
173c520a |
|
07-May-2018 |
Simon Guo <wei.guo.simon@gmail.com> |
KVM: PPC: Move nip/ctr/lr/xer registers to pt_regs in kvm_vcpu_arch This patch moves nip/ctr/lr/xer registers from scattered places in kvm_vcpu_arch to pt_regs structure. cr register is "unsigned long" in pt_regs and u32 in vcpu->arch. It will need more consideration and may move in later patches. Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
7aa15842 |
|
20-Apr-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Set RWMR on POWER8 so PURR/SPURR count correctly Although Linux doesn't use PURR and SPURR ((Scaled) Processor Utilization of Resources Register), other OSes depend on them. On POWER8 they count at a rate depending on whether the VCPU is idle or running, the activity of the VCPU, and the value in the RWMR (Region-Weighting Mode Register). Hardware expects the hypervisor to update the RWMR when a core is dispatched to reflect the number of online VCPUs in the vcore. This adds code to maintain a count in the vcore struct indicating how many VCPUs are online. In kvmppc_run_core we use that count to set the RWMR register on POWER8. If the core is split because of a static or dynamic micro-threading mode, we use the value for 8 threads. The RWMR value is not relevant when the host is executing because Linux does not use the PURR or SPURR register, so we don't bother saving and restoring the host value. For the sake of old userspace which does not set the KVM_REG_PPC_ONLINE register, we set online to 1 if it was 0 at the time of a KVM_RUN ioctl. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
a1f15826 |
|
19-Apr-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add 'online' register to ONE_REG interface This adds a new KVM_REG_PPC_ONLINE register which userspace can set to 0 or 1 via the GET/SET_ONE_REG interface to indicate whether it considers the VCPU to be offline (0), that is, not currently running, or online (1). This will be used in a later patch to configure the register which controls PURR and SPURR accumulation. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
57b8daa7 |
|
20-Apr-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry Currently, the HV KVM guest entry/exit code adds the timebase offset from the vcore struct to the timebase on guest entry, and subtracts it on guest exit. Which is fine, except that it is possible for userspace to change the offset using the SET_ONE_REG interface while the vcore is running, as there is only one timebase offset per vcore but potentially multiple VCPUs in the vcore. If that were to happen, KVM would subtract a different offset on guest exit from that which it had added on guest entry, leading to the timebase being out of sync between cores in the host, which then leads to bad things happening such as hangs and spurious watchdog timeouts. To fix this, we add a new field 'tb_offset_applied' to the vcore struct which stores the offset that is currently applied to the timebase. This value is set from the vcore tb_offset field on guest entry, and is what is subtracted from the timebase on guest exit. Since it is zero when the timebase offset is not applied, we can simplify the logic in kvmhv_start_timing and kvmhv_accumulate_time. In addition, we had secondary threads reading the timebase while running concurrently with code on the primary thread which would eventually add or subtract the timebase offset from the timebase. This occurred while saving or restoring the DEC register value on the secondary threads. Although no specific incorrect behaviour has been observed, this is a race which should be fixed. To fix it, we move the DEC saving code to just before we call kvmhv_commence_exit, and the DEC restoring code to after the point where we have waited for the primary thread to switch the MMU context and add the timebase offset. That way we are sure that the timebase contains the guest timebase value in both cases. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
a4bc64d3 |
|
18-Apr-2018 |
Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> |
powerpc64/ftrace: Disable ftrace during kvm entry/exit During guest entry/exit, we switch over to/from the guest MMU context and we cannot take exceptions in the hypervisor code. Since ftrace may be enabled and since it can result in us taking a trap, disable ftrace by setting paca->ftrace_enabled to zero. There are two paths through which we enter/exit a guest: 1. If we are the vcore runner, then we enter the guest via __kvmppc_vcore_entry() and we disable ftrace around this. This is always the case for Power9, and for the primary thread on Power8. 2. If we are a secondary thread in Power8, then we would be in nap due to SMT being disabled. We are woken up by an IPI to enter the guest. In this scenario, we enter the guest through kvm_start_guest(). We disable ftrace at this point. In this scenario, ftrace would only get re-enabled on the secondary thread when SMT is re-enabled (via start_secondary()). Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
e303c087 |
|
31-Mar-2018 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Fix ppc_breakpoint_available compile error arch/powerpc/kvm/book3s_hv.c: In function ‘kvmppc_h_set_mode’: arch/powerpc/kvm/book3s_hv.c:745:8: error: implicit declaration of function ‘ppc_breakpoint_available’ if (!ppc_breakpoint_available()) ^~~~~~~~~~~~~~~~~~~~~~~~ Fixes: 398e712c007f ("KVM: PPC: Book3S HV: Return error from h_set_mode(SET_DAWR) on POWER9") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
499dcd41 |
|
13-Feb-2018 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64s: Allocate LPPACAs individually We no longer allocate lppacas in an array, so this patch removes the 1kB static alignment for the structure, and enforces the PAPR alignment requirements at allocation time. We can not reduce the 1kB allocation size however, due to existing KVM hypervisors. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
d2e60075 |
|
13-Feb-2018 |
Nicholas Piggin <npiggin@gmail.com> |
powerpc/64: Use array of paca pointers and allocate pacas individually Change the paca array into an array of pointers to pacas. Allocate pacas individually. This allows flexibility in where the PACAs are allocated. Future work will allocate them node-local. Platforms that don't have address limits on PACAs would be able to defer PACA allocations until later in boot rather than allocate all possible ones up-front then freeing unused. This is slightly more overhead (one additional indirection) for cross CPU paca references, but those aren't too common. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
398e712c |
|
26-Mar-2018 |
Michael Neuling <mikey@neuling.org> |
KVM: PPC: Book3S HV: Return error from h_set_mode(SET_DAWR) on POWER9 Return H_P2 on a h_set_mode(SET_DAWR) on POWER9 where the DAWR is disabled. Current Linux guests ignore this error, so they will silently not get the DAWR (sigh). The same error code is being used by POWERVM in this case. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
4bb3c7a0 |
|
21-Mar-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 POWER9 has hardware bugs relating to transactional memory and thread reconfiguration (changes to hardware SMT mode). Specifically, the core does not have enough storage to store a complete checkpoint of all the architected state for all four threads. The DD2.2 version of POWER9 includes hardware modifications designed to allow hypervisor software to implement workarounds for these problems. This patch implements those workarounds in KVM code so that KVM guests see a full, working transactional memory implementation. The problems center around the use of TM suspended state, where the CPU has a checkpointed state but execution is not transactional. The workaround is to implement a "fake suspend" state, which looks to the guest like suspended state but the CPU does not store a checkpoint. In this state, any instruction that would cause a transition to transactional state (rfid, rfebb, mtmsrd, tresume) or would use the checkpointed state (treclaim) causes a "soft patch" interrupt (vector 0x1500) to the hypervisor so that it can be emulated. The trechkpt instruction also causes a soft patch interrupt. On POWER9 DD2.2, we avoid returning to the guest in any state which would require a checkpoint to be present. The trechkpt in the guest entry path which would normally create that checkpoint is replaced by either a transition to fake suspend state, if the guest is in suspend state, or a rollback to the pre-transactional state if the guest is in transactional state. Fake suspend state is indicated by a flag in the PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and reads back as 0. On exit from the guest, if the guest is in fake suspend state, we still do the treclaim instruction as we would in real suspend state, in order to get into non-transactional state, but we do not save the resulting register state since there was no checkpoint. Emulation of the instructions that cause a softpatch interrupt is handled in two paths. If the guest is in real suspend mode, we call kvmhv_p9_tm_emulation_early() to handle the cases where the guest is transitioning to transactional state. This is called before we do the treclaim in the guest exit path; because we haven't done treclaim, we can get back to the guest with the transaction still active. If the instruction is a case that kvmhv_p9_tm_emulation_early() doesn't handle, or if the guest is in fake suspend state, then we proceed to do the complete guest exit path and subsequently call kvmhv_p9_tm_emulation() in host context with the MMU on. This handles all the cases including the cases that generate program interrupts (illegal instruction or TM Bad Thing) and facility unavailable interrupts. The emulation is reasonably straightforward and is mostly concerned with checking for exception conditions and updating the state of registers such as MSR and CR0. The treclaim emulation takes care to ensure that the TEXASR register gets updated as if it were the guest treclaim instruction that had done failure recording, not the treclaim done in hypervisor state in the guest exit path. With this, the KVM_CAP_PPC_HTM capability returns true (1) even if transactional memory is not available to host userspace. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
39c983ea |
|
21-Feb-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Remove unused kvm_unmap_hva callback Since commit fb1522e099f0 ("KVM: update to new mmu_notifier semantic v2", 2017-08-31), the MMU notifier code in KVM no longer calls the kvm_unmap_hva callback. This removes the PPC implementations of kvm_unmap_hva(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
61bd0f66 |
|
02-Mar-2018 |
Laurent Vivier <lvivier@redhat.com> |
KVM: PPC: Book3S HV: Fix guest time accounting with VIRT_CPU_ACCOUNTING_GEN Since commit 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry"), if CONFIG_VIRT_CPU_ACCOUNTING_GEN is set, the guest time is not accounted to guest time and user time, but instead to system time. This is because guest_enter()/guest_exit() are called while interrupts are disabled and the tick counter cannot be updated between them. To fix that, move guest_exit() after local_irq_enable(), and as guest_enter() is called with IRQ disabled, call guest_enter_irqoff() instead. Fixes: 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
debd574f |
|
01-Mar-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing The current code for initializing the VRMA (virtual real memory area) for HPT guests requires the page size of the backing memory to be one of 4kB, 64kB or 16MB. With a radix host we have the possibility that the backing memory page size can be 2MB or 1GB. In these cases, if the guest switches to HPT mode, KVM will not initialize the VRMA and the guest will fail to run. In fact it is not necessary that the VRMA page size is the same as the backing memory page size; any VRMA page size less than or equal to the backing memory page size is acceptable. Therefore we now choose the largest page size out of the set {4k, 64k, 16M} which is not larger than the backing memory page size. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
36ee41d1 |
|
29-Jan-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Drop locks before reading guest memory Running with CONFIG_DEBUG_ATOMIC_SLEEP reveals that HV KVM tries to read guest memory, in order to emulate guest instructions, while preempt is disabled and a vcore lock is held. This occurs in kvmppc_handle_exit_hv(), called from post_guest_process(), when emulating guest doorbell instructions on POWER9 systems, and also when checking whether we have hit a hypervisor breakpoint. Reading guest memory can cause a page fault and thus cause the task to sleep, so we need to avoid reading guest memory while holding a spinlock or when preempt is disabled. To fix this, we move the preempt_enable() in kvmppc_run_core() to before the loop that calls post_guest_process() for each vcore that has just run, and we drop and re-take the vcore lock around the calls to kvmppc_emulate_debug_inst() and kvmppc_emulate_doorbell_instr(). Dropping the lock is safe with respect to the iteration over the runnable vcpus in post_guest_process(); for_each_runnable_thread is actually safe to use locklessly. It is possible for a vcpu to become runnable and add itself to the runnable_threads array (code near the beginning of kvmppc_run_vcpu()) and then get included in the iteration in post_guest_process despite the fact that it has not just run. This is benign because vcpu->arch.trap and vcpu->arch.ceded will be zero. Cc: stable@vger.kernel.org # v4.13+ Fixes: 579006944e0d ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
57ad583f |
|
11-Jan-2017 |
Russell Currey <ruscur@russell.cc> |
powerpc: Use octal numbers for file permissions Symbolic macros are unintuitive and hard to read, whereas octal constants are much easier to interpret. Replace macros for the basic permission flags (user/group/other read/write/execute) with numeric constants instead, across the whole powerpc tree. Introducing a significant number of changes across the tree for no runtime benefit isn't exactly desirable, but so long as these macros are still used in the tree people will keep sending patches that add them. Not only are they hard to parse at a glance, there are multiple ways of coming to the same value (as you can see with 0444 and 0644 in this patch) which hurts readability. Signed-off-by: Russell Currey <ruscur@russell.cc> Reviewed-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
2267ea76 |
|
11-Jan-2018 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
KVM: PPC: Book3S HV: Don't use existing "prodded" flag for XIVE escalations The prodded flag is only cleared at the beginning of H_CEDE, so every time we have an escalation, we will cause the *next* H_CEDE to return immediately. Instead use a dedicated "irq_pending" flag to indicate that a guest interrupt is pending for the VCPU. We don't reuse the existing exception bitmap so as to avoid expensive atomic ops. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
00608e1f |
|
10-Jan-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Allow HPT and radix on the same core for POWER9 v2.2 POWER9 chip versions starting with "Nimbus" v2.2 can support running with some threads of a core in HPT mode and others in radix mode. This means that we don't have to prohibit independent-threads mode when running a HPT guest on a radix host, and we don't have to do any of the synchronization between threads that was introduced in commit c01015091a77 ("KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts", 2017-10-19). Rather than using up another CPU feature bit, we just do an explicit test on the PVR (processor version register) at module startup time to determine whether we have to take steps to avoid having some threads in HPT mode and some in radix mode (so-called "mixed mode"). We test for "Nimbus" (indicated by 0 or 1 in the top nibble of the lower 16 bits) v2.2 or later, or "Cumulus" (indicated by 2 or 3 in that nibble) v1.1 or later. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
5855564c |
|
12-Jan-2018 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Enable migration of decrementer register This adds a register identifier for use with the one_reg interface to allow the decrementer expiry time to be read and written by userspace. The decrementer expiry time is in guest timebase units and is equal to the sum of the decrementer and the guest timebase. (The expiry time is used rather than the decrementer value itself because the expiry time is not constantly changing, though the decrementer value is, while the guest vcpu is not running.) Without this, a guest vcpu migrated to a new host will see its decrementer set to some random value. On POWER8 and earlier, the decrementer is 32 bits wide and counts down at 512MHz, so the guest vcpu will potentially see no decrementer interrupts for up to about 4 seconds, which will lead to a stall. With POWER9, the decrementer is now 56 bits side, so the stall can be much longer (up to 2.23 years) and more noticeable. To help work around the problem in cases where userspace has not been updated to migrate the decrementer expiry time, we now set the default decrementer expiry at vcpu creation time to the current time rather than the maximum possible value. This should mean an immediate decrementer interrupt when a migrated vcpu starts running. In cases where the decrementer is 32 bits wide and more than 4 seconds elapse between the creation of the vcpu and when it first runs, the decrementer would have wrapped around to positive values and there may still be a stall - but this is no worse than the current situation. In the large-decrementer case, we are sure to get an immediate decrementer interrupt (assuming the time from vcpu creation to first run is less than 2.23 years) and we thus avoid a very long stall. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
c0093f1a |
|
19-Nov-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Fix conditions for starting vcpu This corrects the test that determines whether a vcpu that has just become able to run in the guest (e.g. it has just finished handling a hypercall or hypervisor page fault) and whose virtual core is already running somewhere as a "piggybacked" vcore can start immediately or not. (A piggybacked vcore is one which is executing along with another vcore as a result of dynamic micro-threading.) Previously the test tried to lock the piggybacked vcore using spin_trylock, which would always fail because the vcore was already locked, and so the vcpu would have to wait until its vcore exited the guest before it could enter. In fact the vcpu can enter if its vcore is in VCORE_PIGGYBACK state and not already exiting (or exited) the guest, so the test in VCORE_PIGGYBACK state is basically the same as for VCORE_RUNNING state. Coverity detected this as a double unlock issue, which it isn't because the spin_trylock would always fail. This will fix the apparent double unlock as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
4fcf361d |
|
19-Nov-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Remove useless statement This removes a statement that has no effect. It should have been removed in commit 898b25b202f3 ("KVM: PPC: Book3S HV: Simplify dynamic micro-threading code", 2017-06-22) along with the loop over the piggy-backed virtual cores. This issue was reported by Coverity. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
ded13fc1 |
|
21-Nov-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts This fixes two errors that prevent a guest using the HPT MMU from successfully migrating to a POWER9 host in radix MMU mode, or resizing its HPT when running on a radix host. The first bug was that commit 8dc6cca556e4 ("KVM: PPC: Book3S HV: Don't rely on host's page size information", 2017-09-11) missed two uses of hpte_base_page_size(), one in the HPT rehashing code and one in kvm_htab_write() (which is used on the destination side in migrating a HPT guest). Instead we use kvmppc_hpte_base_page_shift(). Having the shift count means that we can use left and right shifts instead of multiplication and division in a few places. Along the way, this adds a check in kvm_htab_write() to ensure that the page size encoding in the incoming HPTEs is recognized, and if not return an EINVAL error to userspace. The second bug was that kvm_htab_write was performing some but not all of the functions of kvmhv_setup_mmu(), resulting in the destination VM being left in radix mode as far as the hardware is concerned. The simplest fix for now is make kvm_htab_write() call kvmppc_setup_partition_table() like kvmppc_hv_setup_htab_rma() does. In future it would be better to refactor the code more extensively to remove the duplication. Fixes: 8dc6cca556e4 ("KVM: PPC: Book3S HV: Don't rely on host's page size information") Fixes: 7a84084c6054 ("KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9") Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
432953b4 |
|
08-Nov-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Cosmetic post-merge cleanups This rearranges the code in kvmppc_run_vcpu() and kvmppc_run_vcpu_hv() to be neater and clearer. Deeply indented code in kvmppc_run_vcpu() is moved out to a helper function, kvmhv_setup_mmu(). In kvmppc_vcpu_run_hv(), make use of the existing variable 'kvm' in place of 'vcpu->kvm'. No functional change. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
38c53af8 |
|
07-Nov-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates Commit 5e9859699aba ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation", 2016-12-20) added code that tries to exclude any use or update of the hashed page table (HPT) while the HPT resizing code is iterating through all the entries in the HPT. It does this by taking the kvm->lock mutex, clearing the kvm->arch.hpte_setup_done flag and then sending an IPI to all CPUs in the host. The idea is that any VCPU task that tries to enter the guest will see that the hpte_setup_done flag is clear and therefore call kvmppc_hv_setup_htab_rma, which also takes the kvm->lock mutex and will therefore block until we release kvm->lock. However, any VCPU that is already in the guest, or is handling a hypervisor page fault or hypercall, can re-enter the guest without rechecking the hpte_setup_done flag. The IPI will cause a guest exit of any VCPUs that are currently in the guest, but does not prevent those VCPU tasks from immediately re-entering the guest. The result is that after resize_hpt_rehash_hpte() has made a HPTE absent, a hypervisor page fault can occur and make that HPTE present again. This includes updating the rmap array for the guest real page, meaning that we now have a pointer in the rmap array which connects with pointers in the old rev array but not the new rev array. In fact, if the HPT is being reduced in size, the pointer in the rmap array could point outside the bounds of the new rev array. If that happens, we can get a host crash later on such as this one: [91652.628516] Unable to handle kernel paging request for data at address 0xd0000000157fb10c [91652.628668] Faulting instruction address: 0xc0000000000e2640 [91652.628736] Oops: Kernel access of bad area, sig: 11 [#1] [91652.628789] LE SMP NR_CPUS=1024 NUMA PowerNV [91652.628847] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal ipmi_powernv ipmi_devintf i2c_core ipmi_msghandler powernv_op_panel nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm scsi_dh_alua dm_service_time dm_multipath tg3 ptp pps_core [last unloaded: stap_552b612747aec2da355051e464fa72a1_14259] [91652.629566] CPU: 136 PID: 41315 Comm: CPU 21/KVM Tainted: G O 4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le #1 [91652.629684] task: c0000007a419e400 task.stack: c0000000028d8000 [91652.629750] NIP: c0000000000e2640 LR: d00000000c36e498 CTR: c0000000000e25f0 [91652.629829] REGS: c0000000028db5d0 TRAP: 0300 Tainted: G O (4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le) [91652.629932] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022422 XER: 00000000 [91652.630034] CFAR: d00000000c373f84 DAR: d0000000157fb10c DSISR: 40000000 SOFTE: 1 [91652.630034] GPR00: d00000000c36e498 c0000000028db850 c000000001403900 c0000007b7960000 [91652.630034] GPR04: d0000000117fb100 d000000007ab00d8 000000000033bb10 0000000000000000 [91652.630034] GPR08: fffffffffffffe7f 801001810073bb10 d00000000e440000 d00000000c373f70 [91652.630034] GPR12: c0000000000e25f0 c00000000fdb9400 f000000003b24680 0000000000000000 [91652.630034] GPR16: 00000000000004fb 00007ff7081a0000 00000000000ec91a 000000000033bb10 [91652.630034] GPR20: 0000000000010000 00000000001b1190 0000000000000001 0000000000010000 [91652.630034] GPR24: c0000007b7ab8038 d0000000117fb100 0000000ec91a1190 c000001e6a000000 [91652.630034] GPR28: 00000000033bb100 000000000073bb10 c0000007b7960000 d0000000157fb100 [91652.630735] NIP [c0000000000e2640] kvmppc_add_revmap_chain+0x50/0x120 [91652.630806] LR [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv] [91652.630884] Call Trace: [91652.630913] [c0000000028db850] [c0000000028db8b0] 0xc0000000028db8b0 (unreliable) [91652.630996] [c0000000028db8b0] [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv] [91652.631091] [c0000000028db9e0] [d00000000c36a078] kvmppc_vcpu_run_hv+0xdf8/0x1300 [kvm_hv] [91652.631179] [c0000000028dbb30] [d00000000c2248c4] kvmppc_vcpu_run+0x34/0x50 [kvm] [91652.631266] [c0000000028dbb50] [d00000000c220d54] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm] [91652.631351] [c0000000028dbbd0] [d00000000c2139d8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [91652.631433] [c0000000028dbd40] [c0000000003832e0] do_vfs_ioctl+0xd0/0x8c0 [91652.631501] [c0000000028dbde0] [c000000000383ba4] SyS_ioctl+0xd4/0x130 [91652.631569] [c0000000028dbe30] [c00000000000b8e0] system_call+0x58/0x6c [91652.631635] Instruction dump: [91652.631676] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ffa1 2fa70000 793d0020 e9432110 [91652.631814] 7bbf26e4 7c7e1b78 7feafa14 409e0094 <807f000c> 786326e4 7c6a1a14 93a40008 [91652.631959] ---[ end trace ac85ba6db72e5b2e ]--- To fix this, we tighten up the way that the hpte_setup_done flag is checked to ensure that it does provide the guarantee that the resizing code needs. In kvmppc_run_core(), we check the hpte_setup_done flag after disabling interrupts and refuse to enter the guest if it is clear (for a HPT guest). The code that checks hpte_setup_done and calls kvmppc_hv_setup_htab_rma() is moved from kvmppc_vcpu_run_hv() to a point inside the main loop in kvmppc_run_vcpu(), ensuring that we don't just spin endlessly calling kvmppc_run_core() while hpte_setup_done is clear, but instead have a chance to block on the kvm->lock mutex. Finally we also check hpte_setup_done inside the region in kvmppc_book3s_hv_page_fault() where the HPTE is locked and we are about to update the HPTE, and bail out if it is clear. If another CPU is inside kvm_vm_ioctl_resize_hpt_commit) and has cleared hpte_setup_done, then we know that either we are looking at a HPTE that resize_hpt_rehash_hpte() has not yet processed, which is OK, or else we will see hpte_setup_done clear and refuse to update it, because of the full barrier formed by the unlock of the HPTE in resize_hpt_rehash_hpte() combined with the locking of the HPTE in kvmppc_book3s_hv_page_fault(). Fixes: 5e9859699aba ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation") Cc: stable@vger.kernel.org # v4.10+ Reported-by: Satheesh Rajendran <satheera@in.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
6de6638b |
|
05-Nov-2017 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Handle host system reset in guest mode If the host takes a system reset interrupt while a guest is running, the CPU must exit the guest before processing the host exception handler. After this patch, taking a sysrq+x with a CPU running in a guest gives a trace like this: cpu 0x27: Vector: 100 (System Reset) at [c000000fdf5776f0] pc: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv] lr: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv] sp: c000000fdf577850 msr: 9000000002803033 current = 0xc000000fdf4b1e00 paca = 0xc00000000fd4d680 softe: 3 irq_happened: 0x01 pid = 6608, comm = qemu-system-ppc Linux version 4.14.0-rc7-01489-g47e1893a404a-dirty #26 SMP [c000000fdf577a00] c008000010159dd4 kvmppc_vcpu_run_hv+0x3dc/0x12d0 [kvm_hv] [c000000fdf577b30] c0080000100a537c kvmppc_vcpu_run+0x44/0x60 [kvm] [c000000fdf577b60] c0080000100a1ae0 kvm_arch_vcpu_ioctl_run+0x118/0x310 [kvm] [c000000fdf577c00] c008000010093e98 kvm_vcpu_ioctl+0x530/0x7c0 [kvm] [c000000fdf577d50] c000000000357bf8 do_vfs_ioctl+0xd8/0x8c0 [c000000fdf577df0] c000000000358448 SyS_ioctl+0x68/0x100 [c000000fdf577e30] c00000000000b220 system_call+0x58/0x6c --- Exception: c01 (System Call) at 00007fff76868df0 SP (7fff7069baf0) is in userspace Fixes: e36d0a2ed5 ("powerpc/powernv: Implement NMI IPI with OPAL_SIGNAL_SYSTEM_RESET") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
c0101509 |
|
18-Oct-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts This patch removes the restriction that a radix host can only run radix guests, allowing us to run HPT (hashed page table) guests as well. This is useful because it provides a way to run old guest kernels that know about POWER8 but not POWER9. Unfortunately, POWER9 currently has a restriction that all threads in a given code must either all be in HPT mode, or all in radix mode. This means that when entering a HPT guest, we have to obtain control of all 4 threads in the core and get them to switch their LPIDR and LPCR registers, even if they are not going to run a guest. On guest exit we also have to get all threads to switch LPIDR and LPCR back to host values. To make this feasible, we require that KVM not be in the "independent threads" mode, and that the CPU cores be in single-threaded mode from the host kernel's perspective (only thread 0 online; threads 1, 2 and 3 offline). That allows us to use the same code as on POWER8 for obtaining control of the secondary threads. To manage the LPCR/LPIDR changes required, we extend the kvm_split_info struct to contain the information needed by the secondary threads. All threads perform a barrier synchronization (where all threads wait for every other thread to reach the synchronization point) on guest entry, both before and after loading LPCR and LPIDR. On guest exit, they all once again perform a barrier synchronization both before and after loading host values into LPCR and LPIDR. Finally, it is also currently necessary to flush the entire TLB every time we enter a HPT guest on a radix host. We do this on thread 0 with a loop of tlbiel instructions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
516f7898 |
|
15-Oct-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded mode This patch allows for a mode on POWER9 hosts where we control all the threads of a core, much as we do on POWER8. The mode is controlled by a module parameter on the kvm_hv module, called "indep_threads_mode". The normal mode on POWER9 is the "independent threads" mode, with indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any desired SMT mode) and each thread independently enters and exits from KVM guests without reference to what other threads in the core are doing. If indep_threads_mode is set to N at the point when a VM is started, KVM will expect every core that the guest runs on to be in single threaded mode (that is, threads 1, 2 and 3 offline), and will set the flag that prevents secondary threads from coming online. We can still use all four threads; the code that implements dynamic micro-threading on POWER8 will become active in over-commit situations and will allow up to three other VCPUs to be run on the secondary threads of the core whenever a VCPU is run. The reason for wanting this mode is that this will allow us to run HPT guests on a radix host on a POWER9 machine that does not support "mixed mode", that is, having some threads in a core be in HPT mode while other threads are in radix mode. It will also make it possible to implement a "strict threads" mode in future, if desired. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
18c3640c |
|
13-Sep-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host This sets up the machinery for switching a guest between HPT (hashed page table) and radix MMU modes, so that in future we can run a HPT guest on a radix host on POWER9 machines. * The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or radix mode, on a radix host. * The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9 with HV KVM on a radix host. * The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a radix host. * The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the guest to HPT mode and allocate a HPT. * For simplicity, we now allocate the rmap array for each memslot, even on a radix host, since it will be needed if the guest switches to HPT mode. * Since we cannot yet run a HPT guest on a radix host, the KVM_RUN ioctl will return an EINVAL error in that case. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
e641a317 |
|
25-Oct-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix Currently, the HPT code in HV KVM maintains a dirty bit per guest page in the rmap array, whether or not dirty page tracking has been enabled for the memory slot. In contrast, the radix code maintains a dirty bit per guest page in memslot->dirty_bitmap, and only does so when dirty page tracking has been enabled. This changes the HPT code to maintain the dirty bits in the memslot dirty_bitmap like radix does. This results in slightly less code overall, and will mean that we do not lose the dirty bits when transitioning between HPT and radix mode in future. There is one minor change to behaviour as a result. With HPT, when dirty tracking was enabled for a memslot, we would previously clear all the dirty bits at that point (both in the HPT entries and in the rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately following would show no pages as dirty (assuming no vcpus have run in the meantime). With this change, the dirty bits on HPT entries are not cleared at the point where dirty tracking is enabled, so KVM_GET_DIRTY_LOG would show as dirty any guest pages that are resident in the HPT and dirty. This is consistent with what happens on radix. This also fixes a bug in the mark_pages_dirty() function for radix (in the sense that the function no longer exists). In the case where a large page of 64 normal pages or more is marked dirty, the addressing of the dirty bitmap was incorrect and could write past the end of the bitmap. Fortunately this case was never hit in practice because a 2MB large page is only 32 x 64kB pages, and we don't support backing the guest with 1GB huge pages at this point. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
1b151ce4 |
|
12-Sep-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_ready This renames the kvm->arch.hpte_setup_done field to mmu_ready because we will want to use it for radix guests too -- both for setting things up before vcpu execution, and for excluding vcpus from executing while MMU-related things get changed, such as in future switching the MMU from radix to HPT mode or vice-versa. This also moves the call to kvmppc_setup_partition_table() that was done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting of mmu_ready, into the caller in kvmppc_vcpu_run_hv(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
8dc6cca5 |
|
10-Sep-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't rely on host's page size information This removes the dependence of KVM on the mmu_psize_defs array (which stores information about hardware support for various page sizes) and the things derived from it, chiefly hpte_page_sizes[], hpte_page_size(), hpte_actual_page_size() and get_sllp_encoding(). We also no longer rely on the mmu_slb_size variable or the MMU_FTR_1T_SEGMENTS feature bit. The reason for doing this is so we can support a HPT guest on a radix host. In a radix host, the mmu_psize_defs array contains information about page sizes supported by the MMU in radix mode rather than the page sizes supported by the MMU in HPT mode. Similarly, mmu_slb_size and the MMU_FTR_1T_SEGMENTS bit are not set. Instead we hard-code knowledge of the behaviour of the HPT MMU in the POWER7, POWER8 and POWER9 processors (which are the only processors supported by HV KVM) - specifically the encoding of the LP fields in the HPT and SLB entries, and the fact that they have 32 SLB entries and support 1TB segments. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
31a4d448 |
|
18-Oct-2017 |
Paul Mackerras <paulus@ozlabs.org> |
Revert "KVM: PPC: Book3S HV: POWER9 does not require secondary thread management" This reverts commit 94a04bc25a2c6296bd0c5e82c10e8231c2b11f77. In order to run HPT guests on a radix POWER9 host, we will have to run the host in single-threaded mode, because POWER9 processors do not currently support running some threads of a core in HPT mode while others are in radix mode ("mixed mode"). That means that we will need the same mechanisms that are used on POWER8 to make the secondary threads available to KVM, which were disabled on POWER9 by commit 94a04bc25a2c. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
4bb817ed |
|
03-Sep-2017 |
Thomas Meyer <thomas@m3y3r.de> |
KVM: PPC: Book3S HV: Use ARRAY_SIZE macro Use ARRAY_SIZE macro, rather than explicitly coding some variant of it yourself. Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\) /ARRAY_SIZE(\1)/g' and manual check/verification. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
267ad7bc |
|
13-Sep-2017 |
Davidlohr Bueso <dave@stgolabs.net> |
kvm,powerpc: Serialize wq active checks in ops->vcpu_kick Particularly because kvmppc_fast_vcpu_kick_hv() is a callback, ensure that we properly serialize wq active checks in order to avoid potentially missing a wakeup due to racing with the waiter side. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
cf5f6f31 |
|
11-Sep-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Hold kvm->lock around call to kvmppc_update_lpcr Commit 468808bd35c4 ("KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9", 2017-01-30) added a call to kvmppc_update_lpcr() which doesn't hold the kvm->lock mutex around the call, as required. This adds the lock/unlock pair, and for good measure, includes the kvmppc_setup_partition_table() call in the locked region, since it is altering global state of the VM. This error appears not to have any fatal consequences for the host; the consequences would be that the VCPUs could end up running with different LPCR values, or an update to the LPCR value by userspace using the one_reg interface could get overwritten, or the update done by kvmhv_configure_mmu() could get overwritten. Cc: stable@vger.kernel.org # v4.10+ Fixes: 468808bd35c4 ("KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
e3bfed1d |
|
25-Aug-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Report storage key support to userspace This adds information about storage keys to the struct returned by the KVM_PPC_GET_SMMU_INFO ioctl. The new fields replace a pad field, which was zeroed by previous kernel versions. Thus userspace that knows about the new fields will see zeroes when running on an older kernel, indicating that storage keys are not supported. The size of the structure has not changed. The number of keys is hard-coded for the CPUs supported by HV KVM, which is just POWER7, POWER8 and POWER9. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
eaac112e |
|
12-Aug-2017 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation KVM currently validates the size of the VPA registered by the client against sizeof(struct lppaca), however we align (and therefore size) that struct to 1kB to avoid crossing a 4kB boundary in the client. PAPR calls for sizes >= 640 bytes to be accepted. Hard code this with a comment. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
94a04bc2 |
|
24-Aug-2017 |
Nicholas Piggin <npiggin@gmail.com> |
KVM: PPC: Book3S HV: POWER9 does not require secondary thread management POWER9 CPUs have independent MMU contexts per thread, so KVM does not need to quiesce secondary threads, so the hwthread_req/hwthread_state protocol does not have to be used. So patch it away on POWER9, and patch away the branch from the Linux idle wakeup to kvm_start_guest that is never used. Add a warning and error out of kvmppc_grab_hwthread in case it is ever called on POWER9. This avoids a hwsync in the idle wakeup path on POWER9. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> [mpe: Use WARN(...) instead of WARN_ON()/pr_err(...)] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
e4705715 |
|
20-Jul-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Enable TM before accessing TM registers Commit 46a704f8409f ("KVM: PPC: Book3S HV: Preserve userspace HTM state properly", 2017-06-15) added code to read transactional memory (TM) registers but forgot to enable TM before doing so. The result is that if userspace does have live values in the TM registers, a KVM_RUN ioctl will cause a host kernel crash like this: [ 181.328511] Unrecoverable TM Unavailable Exception f60 at d00000001e7d9980 [ 181.328605] Oops: Unrecoverable TM Unavailable Exception, sig: 6 [#1] [ 181.328613] SMP NR_CPUS=2048 [ 181.328613] NUMA [ 181.328618] PowerNV [ 181.328646] Modules linked in: vhost_net vhost tap nfs_layout_nfsv41_files rpcsec_gss_krb5 nfsv4 dns_resolver nfs +fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat +nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables +ip6table_filter ip6_tables iptable_filter bridge stp llc kvm_hv kvm nfsd ses enclosure scsi_transport_sas ghash_generic +auth_rpcgss gf128mul xts sg ctr nfs_acl lockd vmx_crypto shpchp ipmi_powernv i2c_opal grace ipmi_devintf i2c_core +powernv_rng sunrpc ipmi_msghandler ibmpowernv uio_pdrv_genirq uio leds_powernv powernv_op_panel ip_tables xfs sd_mod +lpfc ipr bnx2x libata mdio ptp pps_core scsi_transport_fc libcrc32c dm_mirror dm_region_hash dm_log dm_mod [ 181.329278] CPU: 40 PID: 9926 Comm: CPU 0/KVM Not tainted 4.12.0+ #1 [ 181.329337] task: c000003fc6980000 task.stack: c000003fe4d80000 [ 181.329396] NIP: d00000001e7d9980 LR: d00000001e77381c CTR: d00000001e7d98f0 [ 181.329465] REGS: c000003fe4d837e0 TRAP: 0f60 Not tainted (4.12.0+) [ 181.329523] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> [ 181.329527] CR: 24022448 XER: 00000000 [ 181.329608] CFAR: d00000001e773818 SOFTE: 1 [ 181.329608] GPR00: d00000001e77381c c000003fe4d83a60 d00000001e7ef410 c000003fdcfe0000 [ 181.329608] GPR04: c000003fe4f00000 0000000000000000 0000000000000000 c000003fd7954800 [ 181.329608] GPR08: 0000000000000001 c000003fc6980000 0000000000000000 d00000001e7e2880 [ 181.329608] GPR12: d00000001e7d98f0 c000000007b19000 00000001295220e0 00007fffc0ce2090 [ 181.329608] GPR16: 0000010011886608 00007fff8c89f260 0000000000000001 00007fff8c080028 [ 181.329608] GPR20: 0000000000000000 00000100118500a6 0000010011850000 0000010011850000 [ 181.329608] GPR24: 00007fffc0ce1b48 0000010011850000 00000000d673b901 0000000000000000 [ 181.329608] GPR28: 0000000000000000 c000003fdcfe0000 c000003fdcfe0000 c000003fe4f00000 [ 181.330199] NIP [d00000001e7d9980] kvmppc_vcpu_run_hv+0x90/0x6b0 [kvm_hv] [ 181.330264] LR [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm] [ 181.330322] Call Trace: [ 181.330351] [c000003fe4d83a60] [d00000001e773478] kvmppc_set_one_reg+0x48/0x340 [kvm] (unreliable) [ 181.330437] [c000003fe4d83b30] [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm] [ 181.330513] [c000003fe4d83b50] [d00000001e7700b4] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm] [ 181.330586] [c000003fe4d83bd0] [d00000001e7642f8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [ 181.330658] [c000003fe4d83d40] [c0000000003451b8] do_vfs_ioctl+0xc8/0x8b0 [ 181.330717] [c000003fe4d83de0] [c000000000345a64] SyS_ioctl+0xc4/0x120 [ 181.330776] [c000003fe4d83e30] [c00000000000b004] system_call+0x58/0x6c [ 181.330833] Instruction dump: [ 181.330869] e92d0260 e9290b50 e9290108 792807e3 41820058 e92d0260 e9290b50 e9290108 [ 181.330941] 792ae8a4 794a1f87 408204f4 e92d0260 <7d4022a6> f9490ff0 e92d0260 7d4122a6 [ 181.331013] ---[ end trace 6f6ddeb4bfe92a92 ]--- The fix is just to turn on the TM bit in the MSR before accessing the registers. Cc: stable@vger.kernel.org # v3.14+ Fixes: 46a704f8409f ("KVM: PPC: Book3S HV: Preserve userspace HTM state properly") Reported-by: Jan Stancek <jstancek@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
8b24e69f |
|
25-Jun-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Close race with testing for signals on guest entry At present, interrupts are hard-disabled fairly late in the guest entry path, in the assembly code. Since we check for pending signals for the vCPU(s) task(s) earlier in the guest entry path, it is possible for a signal to be delivered before we enter the guest but not be noticed until after we exit the guest for some other reason. Similarly, it is possible for the scheduler to request a reschedule while we are in the guest entry path, and we won't notice until after we have run the guest, potentially for a whole timeslice. Furthermore, with a radix guest on POWER9, we can take the interrupt with the MMU on. In this case we end up leaving interrupts hard-disabled after the guest exit, and they are likely to stay hard-disabled until we exit to userspace or context-switch to another process. This was masking the fact that we were also not setting the RI (recoverable interrupt) bit in the MSR, meaning that if we had taken an interrupt, it would have crashed the host kernel with an unrecoverable interrupt message. To close these races, we need to check for signals and reschedule requests after hard-disabling interrupts, and then keep interrupts hard-disabled until we enter the guest. If there is a signal or a reschedule request from another CPU, it will send an IPI, which will cause a guest exit. This puts the interrupt disabling before we call kvmppc_start_thread() for all the secondary threads of this core that are going to run vCPUs. The reason for that is that once we have started the secondary threads there is no easy way to back out without going through at least part of the guest entry path. However, kvmppc_start_thread() includes some code for radix guests which needs to call smp_call_function(), which must be called with interrupts enabled. To solve this problem, this patch moves that code into a separate function that is called earlier. When the guest exit is caused by an external interrupt, a hypervisor doorbell or a hypervisor maintenance interrupt, we now handle these using the replay facility. __kvmppc_vcore_entry() now returns the trap number that caused the exit on this thread, and instead of the assembly code jumping to the handler entry, we return to C code with interrupts still hard-disabled and set the irq_happened flag in the PACA, so that when we do local_irq_enable() the appropriate handler gets called. With all this, we now have the interrupt soft-enable flag clear while we are in the guest. This is useful because code in the real-mode hypercall handlers that checks whether interrupts are enabled will now see that they are disabled, which is correct, since interrupts are hard-disabled in the real-mode code. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
898b25b2 |
|
21-Jun-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Simplify dynamic micro-threading code Since commit b009031f74da ("KVM: PPC: Book3S HV: Take out virtual core piggybacking code", 2016-09-15), we only have at most one vcore per subcore. Previously, the fact that there might be more than one vcore per subcore meant that we had the notion of a "master vcore", which was the vcore that controlled thread 0 of the subcore. We also needed a list per subcore in the core_info struct to record which vcores belonged to each subcore. Now that there can only be one vcore in the subcore, we can replace the list with a simple pointer and get rid of the notion of the master vcore (and in fact treat every vcore as a master vcore). We can also get rid of the subcore_vm[] field in the core_info struct since it is never read. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
e20bbd3d |
|
11-May-2017 |
Aravinda Prasad <aravinda@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Exit guest upon MCE when FWNMI capability is enabled Enhance KVM to cause a guest exit with KVM_EXIT_NMI exit reason upon a machine check exception (MCE) in the guest address space if the KVM_CAP_PPC_FWNMI capability is enabled (instead of delivering a 0x200 interrupt to guest). This enables QEMU to build error log and deliver machine check exception to guest via guest registered machine check handler. This approach simplifies the delivery of machine check exception to guest OS compared to the earlier approach of KVM directly invoking 0x200 guest interrupt vector. This design/approach is based on the feedback for the QEMU patches to handle machine check exception. Details of earlier approach of handling machine check exception in QEMU and related discussions can be found at: https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html Note: This patch now directly invokes machine_check_print_event_info() from kvmppc_handle_exit_hv() to print the event to host console at the time of guest exit before the exception is passed on to the guest. Hence, the host-side handling which was performed earlier via machine_check_fwnmi is removed. The reasons for this approach is (i) it is not possible to distinguish whether the exception occurred in the guest or the host from the pt_regs passed on the machine_check_exception(). Hence machine_check_exception() calls panic, instead of passing on the exception to the guest, if the machine check exception is not recoverable. (ii) the approach introduced in this patch gives opportunity to the host kernel to perform actions in virtual mode before passing on the exception to the guest. This approach does not require complex tweaks to machine_check_fwnmi and friends. Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
ee3308a2 |
|
19-Jun-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't sleep if XIVE interrupt pending on POWER9 On a POWER9 system, it is possible for an interrupt to become pending for a VCPU when that VCPU is about to cede (execute a H_CEDE hypercall) and has already disabled interrupts, or in the H_CEDE processing up to the point where the XIVE context is pulled from the hardware. In such a case, the H_CEDE should not sleep, but should return immediately to the guest. However, the conditions tested in kvmppc_vcpu_woken() don't include the condition that a XIVE interrupt is pending, so the VCPU could sleep until the next decrementer interrupt. To fix this, we add a new xive_interrupt_pending() helper which looks in the XIVE context that was pulled from the hardware to see if the priority of any pending interrupt is higher (numerically lower than) the CPU priority. If so then kvmppc_vcpu_woken() will return true. If the XIVE context has never been used, then both the pipr and the cppr fields will be zero and the test will indicate that no interrupt is pending. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
57900694 |
|
16-May-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 On POWER9, we no longer have the restriction that we had on POWER8 where all threads in a core have to be in the same partition, so the CPU threads are now independent. However, we still want to be able to run guests with a virtual SMT topology, if only to allow migration of guests from POWER8 systems to POWER9. A guest that has a virtual SMT mode greater than 1 will expect to be able to use the doorbell facility; it will expect the msgsndp and msgclrp instructions to work appropriately and to be able to read sensible values from the TIR (thread identification register) and DPDES (directed privileged doorbell exception status) special-purpose registers. However, since each CPU thread is a separate sub-processor in POWER9, these instructions and registers can only be used within a single CPU thread. In order for these instructions to appear to act correctly according to the guest's virtual SMT mode, we have to trap and emulate them. We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR register. The emulation is triggered by the hypervisor facility unavailable interrupt that occurs when the guest uses them. To cause a doorbell interrupt to occur within the guest, we set the DPDES register to 1. If the guest has interrupts enabled, the CPU will generate a doorbell interrupt and clear the DPDES register in hardware. The DPDES hardware register for the guest is saved in the vcpu->arch.vcore->dpdes field. Since this gets written by the guest exit code, other VCPUs wishing to cause a doorbell interrupt don't write that field directly, but instead set a vcpu->arch.doorbell_request flag. This is consumed and set to 0 by the guest entry code, which then sets DPDES to 1. Emulating reads of the DPDES register is somewhat involved, because it requires reading the doorbell pending interrupt status of all of the VCPU threads in the virtual core, and if any of those VCPUs are running, their doorbell status is only up-to-date in the hardware DPDES registers of the CPUs where they are running. In order to get a reasonable approximation of the current doorbell status, we send those CPUs an IPI, causing an exit from the guest which will update the vcpu->arch.vcore->dpdes field. We then use that value in constructing the emulated DPDES register value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
3c313524 |
|
05-Feb-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode This allows userspace to set the desired virtual SMT (simultaneous multithreading) mode for a VM, that is, the number of VCPUs that get assigned to each virtual core. Previously, the virtual SMT mode was fixed to the number of threads per subcore, and if userspace wanted to have fewer vcpus per vcore, then it would achieve that by using a sparse CPU numbering. This had the disadvantage that the vcpu numbers can get quite large, particularly for SMT1 guests on a POWER8 with 8 threads per core. With this patch, userspace can set its desired virtual SMT mode and then use contiguous vcpu numbering. On POWER8, where the threading mode is "strict", the virtual SMT mode must be less than or equal to the number of threads per subcore. On POWER9, which implements a "loose" threading mode, the virtual SMT mode can be any power of 2 between 1 and 8, even though there is effectively one thread per subcore, since the threads are independent and can all be in different partitions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
769377f7 |
|
14-Feb-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Context-switch HFSCR between host and guest on POWER9 This adds code to allow us to use a different value for the HFSCR (Hypervisor Facilities Status and Control Register) when running the guest from that which applies in the host. The reason for doing this is to allow us to trap the msgsndp instruction and related operations in future so that they can be virtualized. We also save the value of HFSCR when a hypervisor facility unavailable interrupt occurs, because the high byte of HFSCR indicates which facility the guest attempted to access. We save and restore the host value on guest entry/exit because some bits of it affect host userspace execution. We only do all this on POWER9, not on POWER8, because we are not intending to virtualize any of the facilities controlled by HFSCR on POWER8. In particular, the HFSCR bit that controls execution of msgsndp and related operations does not exist on POWER8. The HFSCR doesn't exist at all on POWER7. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
1da4e2f4 |
|
19-May-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't let VCPU sleep if it has a doorbell pending It is possible, through a narrow race condition, for a VCPU to exit the guest with a H_CEDE hypercall while it has a doorbell interrupt pending. In this case, the H_CEDE should return immediately, but in fact it puts the VCPU to sleep until some other interrupt becomes pending or a prod is received (via another VCPU doing H_PROD). This fixes it by checking the DPDES (Directed Privileged Doorbell Exception Status) bit for the thread along with the other interrupt pending bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
1bc3fe81 |
|
22-May-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9 This allows userspace (e.g. QEMU) to enable large decrementer mode for the guest when running on a POWER9 host, by setting the LPCR_LD bit in the guest LPCR value. With this, the guest exit code saves 64 bits of the guest DEC value on exit. Other places that use the guest DEC value check the LPCR_LD bit in the guest LPCR value, and if it is set, omit the 32-bit sign extension that would otherwise be done. This doesn't change the DEC emulation used by PR KVM because PR KVM is not supported on POWER9 yet. This is partly based on an earlier patch by Oliver O'Halloran. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
3d3efb68 |
|
05-Jun-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Ignore timebase offset on POWER9 DD1 POWER9 DD1 has an erratum where writing to the TBU40 register, which is used to apply an offset to the timebase, can cause the timebase to lose counts. This results in the timebase on some CPUs getting out of sync with other CPUs, which then results in misbehaviour of the timekeeping code. To work around the problem, we make KVM ignore the timebase offset for all guests on POWER9 DD1 machines. This means that live migration cannot be supported on POWER9 DD1 machines. Cc: stable@vger.kernel.org # v4.10+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
46a704f8 |
|
15-Jun-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Preserve userspace HTM state properly If userspace attempts to call the KVM_RUN ioctl when it has hardware transactional memory (HTM) enabled, the values that it has put in the HTM-related SPRs TFHAR, TFIAR and TEXASR will get overwritten by guest values. To fix this, we detect this condition and save those SPR values in the thread struct, and disable HTM for the task. If userspace goes to access those SPRs or the HTM facility in future, a TM-unavailable interrupt will occur and the handler will reload those SPRs and re-enable HTM. If userspace has started a transaction and suspended it, we would currently lose the transactional state in the guest entry path and would almost certainly get a "TM Bad Thing" interrupt, which would cause the host to crash. To avoid this, we detect this case and return from the KVM_RUN ioctl with an EINVAL error, with the KVM exit reason set to KVM_EXIT_FAIL_ENTRY. Fixes: b005255e12a3 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08) Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
4c3bb4cc |
|
14-Jun-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit This restores several special-purpose registers (SPRs) to sane values on guest exit that were missed before. TAR and VRSAVE are readable and writable by userspace, and we need to save and restore them to prevent the guest from potentially affecting userspace execution (not that TAR or VRSAVE are used by any known program that run uses the KVM_RUN ioctl). We save/restore these in kvmppc_vcpu_run_hv() rather than on every guest entry/exit. FSCR affects userspace execution in that it can prohibit access to certain facilities by userspace. We restore it to the normal value for the task on exit from the KVM_RUN ioctl. IAMR is normally 0, and is restored to 0 on guest exit. However, with a radix host on POWER9, it is set to a value that prevents the kernel from executing user-accessible memory. On POWER9, we save IAMR on guest entry and restore it on guest exit to the saved value rather than 0. On POWER8 we continue to set it to 0 on guest exit. PSPB is normally 0. We restore it to 0 on guest exit to prevent userspace taking advantage of the guest having set it non-zero (which would allow userspace to set its SMT priority to high). UAMOR is normally 0. We restore it to 0 on guest exit to prevent the AMR from being used as a covert channel between userspace processes, since the AMR is not context-switched at present. Fixes: b005255e12a3 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08) Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
ca8efa1d |
|
06-Jun-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Context-switch EBB registers properly This adds code to save the values of three SPRs (special-purpose registers) used by userspace to control event-based branches (EBBs), which are essentially interrupts that get delivered directly to userspace. These registers are loaded up with guest values when entering the guest, and their values are saved when exiting the guest, but we were not saving the host values and restoring them before going back to userspace. On POWER8 this would only affect userspace programs which explicitly request the use of EBBs and also use the KVM_RUN ioctl, since the only source of EBBs on POWER8 is the PMU, and there is an explicit enable bit in the PMU registers (and those PMU registers do get properly context-switched between host and guest). On POWER9 there is provision for externally-generated EBBs, and these are not subject to the control in the PMU registers. Since these registers only affect userspace, we can save them when we first come in from userspace and restore them before returning to userspace, rather than saving/restoring the host values on every guest entry/exit. Similarly, we don't need to worry about their values on offline secondary threads since they execute in the context of the idle task, which never executes in userspace. Fixes: b005255e12a3 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08) Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
419af25f |
|
24-May-2017 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
KVM/PPC/Book3S HV: Use cpuhp_setup_state_nocalls_cpuslocked() kvmppc_alloc_host_rm_ops() holds get_online_cpus() while invoking cpuhp_setup_state_nocalls(). cpuhp_setup_state_nocalls() invokes get_online_cpus() as well. This is correct, but prevents the conversion of the hotplug locking to a percpu rwsem. Use cpuhp_setup_state_nocalls_cpuslocked() to avoid the nested call. Convert *_online_cpus() to the new interfaces while at it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: kvm@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: kvm-ppc@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Alexander Graf <agraf@suse.com> Link: http://lkml.kernel.org/r/20170524081547.809616236@linutronix.de
|
#
db4b0dfa |
|
29-Mar-2017 |
Denis Kirjanov <kda@linux-powerpc.org> |
KVM: PPC: Book3S HV: Avoid preemptibility warning in module initialization With CONFIG_DEBUG_PREEMPT, get_paca() produces the following warning in kvmppc_book3s_init_hv() since it calls debug_smp_processor_id(). There is no real issue with the xics_phys field. If paca->kvm_hstate.xics_phys is non-zero on one cpu, it will be non-zero on them all. Therefore this is not fixing any actual problem, just the warning. [ 138.521188] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/5596 [ 138.521308] caller is .kvmppc_book3s_init_hv+0x184/0x350 [kvm_hv] [ 138.521404] CPU: 5 PID: 5596 Comm: modprobe Not tainted 4.11.0-rc3-00022-gc7e790c #1 [ 138.521509] Call Trace: [ 138.521563] [c0000007d018b810] [c0000000023eef10] .dump_stack+0xe4/0x150 (unreliable) [ 138.521694] [c0000007d018b8a0] [c000000001f6ec04] .check_preemption_disabled+0x134/0x150 [ 138.521829] [c0000007d018b940] [d00000000a010274] .kvmppc_book3s_init_hv+0x184/0x350 [kvm_hv] [ 138.521963] [c0000007d018ba00] [c00000000191d5cc] .do_one_initcall+0x5c/0x1c0 [ 138.522082] [c0000007d018bad0] [c0000000023e9494] .do_init_module+0x84/0x240 [ 138.522201] [c0000007d018bb70] [c000000001aade18] .load_module+0x1f68/0x2a10 [ 138.522319] [c0000007d018bd20] [c000000001aaeb30] .SyS_finit_module+0xc0/0xf0 [ 138.522439] [c0000007d018be30] [c00000000191baec] system_call+0x38/0xfc Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
5af50993 |
|
05-Apr-2017 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller This patch makes KVM capable of using the XIVE interrupt controller to provide the standard PAPR "XICS" style hypercalls. It is necessary for proper operations when the host uses XIVE natively. This has been lightly tested on an actual system, including PCI pass-through with a TG3 device. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and adding empty stubs for the kvm_xive_xxx() routines, fixup subject, integrate fixes from Paul for building PR=y HV=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
a1c52e1c |
|
20-Jan-2017 |
Markus Elfring <elfring@users.sourceforge.net> |
KVM: PPC: Book3S HV: Use common error handling code in kvmppc_clr_passthru_irq() Add a jump target so that a bit of exception handling can be better reused at the end of this function. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
d3989143 |
|
05-Apr-2017 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/kvm: Massage order of #include We traditionally have linux/ before asm/ Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
03441a34 |
|
08-Feb-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Prepare for new header dependencies before moving code to <linux/sched/stat.h> We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/stat.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
174cd4b1 |
|
02-Feb-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
5e985969 |
|
19-Dec-2016 |
David Gibson <david@gibson.dropbear.id.au> |
KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation This adds a not yet working outline of the HPT resizing PAPR extension. Specifically it adds the necessary ioctl() functions, their basic steps, the work function which will handle preparation for the resize, and synchronization between these, the guest page fault path and guest HPT update path. The actual guts of the implementation isn't here yet, so for now the calls will always fail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
f98a8bf9 |
|
19-Dec-2016 |
David Gibson <david@gibson.dropbear.id.au> |
KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page table (HPT) that userspace expects a guest VM to have, and is also used to clear that HPT when necessary (e.g. guest reboot). At present, once the ioctl() is called for the first time, the HPT size can never be changed thereafter - it will be cleared but always sized as from the first call. With upcoming HPT resize implementation, we're going to need to allow userspace to resize the HPT at reset (to change it back to the default size if the guest changed it). So, we need to allow this ioctl() to change the HPT size. This patch also updates Documentation/virtual/kvm/api.txt to reflect the new behaviour. In fact the documentation was already slightly incorrect since 572abd5 "KVM: PPC: Book3S HV: Don't fall back to smaller HPT size in allocation ioctl" Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
aae0777f |
|
19-Dec-2016 |
David Gibson <david@gibson.dropbear.id.au> |
KVM: PPC: Book3S HV: Split HPT allocation from activation Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT) and sets it up as the active page table for a VM. For the upcoming HPT resize implementation we're going to want to allocate HPTs separately from activating them. So, split the allocation itself out into kvmppc_allocate_hpt() and perform the activation with a new kvmppc_set_hpt() function. Likewise we split kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt() which unsets it as an active HPT, then frees it. We also move the logic to fall back to smaller HPT sizes if the first try fails into the single caller which used that behaviour, kvmppc_hv_setup_htab_rma(). This introduces a slight semantic change, in that previously if the initial attempt at CMA allocation failed, we would fall back to attempting smaller sizes with the page allocator. Now, we try first CMA, then the page allocator at each size. As far as I can tell this change should be harmless. To match, we make kvmppc_free_hpt() just free the actual HPT itself. The call to kvmppc_free_lpid() that was there, we move to the single caller. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
3f9d4f5a |
|
19-Dec-2016 |
David Gibson <david@gibson.dropbear.id.au> |
KVM: PPC: Book3S HV: Gather HPT related variables into sub-structure Currently, the powerpc kvm_arch structure contains a number of variables tracking the state of the guest's hashed page table (HPT) in KVM HV. This patch gathers them all together into a single kvm_hpt_info substructure. This makes life more convenient for the upcoming HPT resizing implementation. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
8cf4ecc0 |
|
30-Jan-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Enable radix guest support This adds a few last pieces of the support for radix guests: * Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and KVM_PPC_GET_RMMU_INFO ioctls for radix guests * On POWER9, allow secondary threads to be on/off-lined while guests are running. * Set up LPCR and the partition table entry for radix guests. * Don't allocate the rmap array in the kvm_memory_slot structure on radix. * Don't try to initialize the HPT for radix guests, since they don't have an HPT. * Take out the code that prevents the HV KVM module from initializing on radix hosts. At this stage, we only support radix guests if the host is running in radix mode, and only support HPT guests if the host is running in HPT mode. Thus a guest cannot switch from one mode to the other, which enables some simplifications. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
a29ebeaf |
|
30-Jan-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement With radix, the guest can do TLB invalidations itself using the tlbie (global) and tlbiel (local) TLB invalidation instructions. Linux guests use local TLB invalidations for translations that have only ever been accessed on one vcpu. However, that doesn't mean that the translations have only been accessed on one physical cpu (pcpu) since vcpus can move around from one pcpu to another. Thus a tlbiel might leave behind stale TLB entries on a pcpu where the vcpu previously ran, and if that task then moves back to that previous pcpu, it could see those stale TLB entries and thus access memory incorrectly. The usual symptom of this is random segfaults in userspace programs in the guest. To cope with this, we detect when a vcpu is about to start executing on a thread in a core that is a different core from the last time it executed. If that is the case, then we mark the core as needing a TLB flush and then send an interrupt to any thread in the core that is currently running a vcpu from the same guest. This will get those vcpus out of the guest, and the first one to re-enter the guest will do the TLB flush. The reason for interrupting the vcpus executing on the old core is to cope with the following scenario: CPU 0 CPU 1 CPU 4 (core 0) (core 0) (core 1) VCPU 0 runs task X VCPU 1 runs core 0 TLB gets entries from task X VCPU 0 moves to CPU 4 VCPU 0 runs task X Unmap pages of task X tlbiel (still VCPU 1) task X moves to VCPU 1 task X runs task X sees stale TLB entries That is, as soon as the VCPU starts executing on the new core, it could unmap and tlbiel some page table entries, and then the task could migrate to one of the VCPUs running on the old core and potentially see stale TLB entries. Since the TLB is shared between all the threads in a core, we only use the bit of kvm->arch.need_tlb_flush corresponding to the first thread in the core. To ensure that we don't have a window where we can miss a flush, this moves the clearing of the bit from before the actual flush to after it. This way, two threads might both do the flush, but we prevent the situation where one thread can enter the guest before the flush is finished. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
8f7b79b8 |
|
30-Jan-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Implement dirty page logging for radix guests This adds code to keep track of dirty pages when requested (that is, when memslot->dirty_bitmap is non-NULL) for radix guests. We use the dirty bits in the PTEs in the second-level (partition-scoped) page tables, together with a bitmap of pages that were dirty when their PTE was invalidated (e.g., when the page was paged out). This bitmap is stored in the first half of the memslot->dirty_bitmap area, and kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the bitmap that gets returned to userspace. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
5a319350 |
|
30-Jan-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Page table construction and page faults for radix guests This adds the code to construct the second-level ("partition-scoped" in architecturese) page tables for guests using the radix MMU. Apart from the PGD level, which is allocated when the guest is created, the rest of the tree is all constructed in response to hypervisor page faults. As well as hypervisor page faults for missing pages, we also get faults for reference/change (RC) bits needing to be set, as well as various other error conditions. For now, we only set the R or C bit in the guest page table if the same bit is set in the host PTE for the backing page. This code can take advantage of the guest being backed with either transparent or ordinary 2MB huge pages, and insert 2MB page entries into the guest page tables. There is no support for 1GB huge pages yet. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
468808bd |
|
30-Jan-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9 This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl for HPT guests on POWER9. With this, we can return 1 for the KVM_CAP_PPC_MMU_HASH_V3 capability. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
c9270132 |
|
30-Jan-2017 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU This adds two capabilities and two ioctls to allow userspace to find out about and configure the POWER9 MMU in a guest. The two capabilities tell userspace whether KVM can support a guest using the radix MMU, or using the hashed page table (HPT) MMU with a process table and segment tables. (Note that the MMUs in the POWER9 processor cores do not use the process and segment tables when in HPT mode, but the nest MMU does). The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify whether a guest will use the radix MMU or the HPT MMU, and to specify the size and location (in guest space) of the process table. The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about the radix MMU. It returns a list of supported radix tree geometries (base page size and number of bits indexed at each level of the radix tree) and the encoding used to specify the various page sizes for the TLB invalidate entry instruction. Initially, both capabilities return 0 and the ioctls return -EINVAL, until the necessary infrastructure for them to operate correctly is added. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
8464c884 |
|
06-Dec-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Fix H_PROD to actually wake the target vcpu The H_PROD hypercall is supposed to wake up an idle vcpu. We have an implementation, but because Linux doesn't use it except when doing cpu hotplug, it was never tested properly. AIX does use it, and reported it broken. It turns out we were waking the wrong vcpu (the one doing H_PROD, not the target of the prod) and we weren't handling the case where the target needs an IPI to wake it. Fix it by using the existing kvmppc_fast_vcpu_kick_hv() function, which is intended for this kind of thing, and by using the target vcpu not the current vcpu. We were also not looking at the prodded flag when checking whether a ceded vcpu should wake up, so this adds checks for the prodded flag alongside the checks for pending exceptions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
3deda5e5 |
|
19-Dec-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Don't try to signal cpu -1 If the target vcpu for kvmppc_fast_vcpu_kick_hv() is not running on any CPU, then we will have vcpu->arch.thread_cpu == -1, and as it happens, kvmppc_fast_vcpu_kick_hv will call kvmppc_ipi_thread with -1 as the cpu argument. Although this is not meaningful, in the past, before commit 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9", 2016-11-18), it was harmless because CPU -1 is not in the same core as any real CPU thread. On a POWER9, however, we don't do the "same core" check, so we were trying to do a msgsnd to thread -1, which is invalid. To avoid this, we add a check to see that vcpu->arch.thread_cpu is >= 0 before calling kvmppc_ipi_thread() with it. Since vcpu->arch.thread_vcpu can change asynchronously, we use READ_ONCE to ensure that the value we check is the same value that we use as the argument to kvmppc_ipi_thread(). Fixes: 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
8b0e1953 |
|
24-Dec-2016 |
Thomas Gleixner <tglx@linutronix.de> |
ktime: Cleanup ktime_set() usage ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
|
#
7c0f6ba6 |
|
24-Dec-2016 |
Linus Torvalds <torvalds@linux-foundation.org> |
Replace <asm/uaccess.h> with <linux/uaccess.h> globally This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3f7cd919 |
|
26-Nov-2016 |
Anna-Maria Gleixner <anna-maria@linutronix.de> |
KVM/PPC/Book3S HV: Convert to hotplug state machine Install the callbacks via the state machine. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: kvm@vger.kernel.org Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: kvm-ppc@vger.kernel.org Cc: Paul Mackerras <paulus@samba.org> Cc: rt@linutronix.de Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alexander Graf <agraf@suse.com> Link: http://lkml.kernel.org/r/20161126231350.10321-18-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
#
908a0935 |
|
13-Oct-2016 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Comment style and print format fixups Fix comment block to match kernel comment style. Fix print format from signed to unsigned. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
e03f3921 |
|
13-Oct-2016 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Add check for module parameter halt_poll_ns The kvm module parameter halt_poll_ns defines the global maximum halt polling interval and can be dynamically changed by writing to the /sys/module/kvm/parameters/halt_poll_ns sysfs file. However in kvm-hv this module parameter value is only ever checked when we grow the current polling interval for the given vcore. This means that if we decrease the halt_poll_ns value below the current polling interval we won't see any effect unless we try to grow the polling interval above the new max at some point or it happens to be shrunk below the halt_poll_ns value. Update the halt polling code so that we always check for a new module param value of halt_poll_ns and set the current halt polling interval to it if it's currently greater than the new max. This means that it's redundant to also perform this check in the grow_halt_poll_ns() function now. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
307d93e4 |
|
13-Oct-2016 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Use generic kvm module parameters The previous patch exported the variables which back the module parameters of the generic kvm module. Now use these variables in the kvm-hv module so that any change to the generic module parameters will also have the same effect for the kvm-hv module. This removes the duplication of the kvm module parameters which was redundant and should reduce confusion when tuning them. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
2ee13be3 |
|
13-Nov-2016 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00 The function kvmppc_set_arch_compat() is used to determine the value of the processor compatibility register (PCR) for a guest running in a given compatibility mode. There is currently no support for v3.00 of the ISA. Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode to the PCR. We also add a check to ensure the processor we are running on is capable of emulating the chosen processor (for example a POWER7 cannot emulate a POWER8, similarly with a POWER8 and a POWER9). Based on work by: Paul Mackerras <paulus@ozlabs.org> [paulus@ozlabs.org - moved dummy PCR_ARCH_300 definition here; set guest_pcr_bit when arch_compat == 0, added comment.] Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
45c940ba |
|
17-Nov-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Treat POWER9 CPU threads as independent subcores With POWER9, each CPU thread has its own MMU context and can be in the host or a guest independently of the other threads; there is still however a restriction that all threads must use the same type of address translation, either radix tree or hashed page table (HPT). Since we only support HPT guests on a HPT host at this point, we can treat the threads as being independent, and avoid all of the work of coordinating the CPU threads. To make this simpler, we introduce a new threads_per_vcore() function that returns 1 on POWER9 and threads_per_subcore on POWER7/8, and use that instead of threads_per_subcore or threads_per_core in various places. This also changes the value of the KVM_CAP_PPC_SMT capability on POWER9 systems from 4 to 1, so that userspace will not try to create VMs with multiple vcpus per vcore. (If userspace did create a VM that thought it was in an SMT mode, the VM might try to use the msgsndp instruction, which will not work as expected. In future it may be possible to trap and emulate msgsndp in order to allow VMs to think they are in an SMT mode, if only for the purpose of allowing migration from POWER8 systems.) With all this, we can now run guests on POWER9 as long as the host is running with HPT translation. Since userspace currently has no way to request radix tree translation for the guest, the guest has no choice but to use HPT translation. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
84f7139c |
|
21-Nov-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Enable hypervisor virtualization interrupts while in guest The new XIVE interrupt controller on POWER9 can direct external interrupts to the hypervisor or the guest. The interrupts directed to the hypervisor are controlled by an LPCR bit called LPCR_HVICE, and come in as a "hypervisor virtualization interrupt". This sets the LPCR bit so that hypervisor virtualization interrupts can occur while we are in the guest. We then also need to cope with exiting the guest because of a hypervisor virtualization interrupt. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
f725758b |
|
17-Nov-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9 POWER9 includes a new interrupt controller, called XIVE, which is quite different from the XICS interrupt controller on POWER7 and POWER8 machines. KVM-HV accesses the XICS directly in several places in order to send and clear IPIs and handle interrupts from PCI devices being passed through to the guest. In order to make the transition to XIVE easier, OPAL firmware will include an emulation of XICS on top of XIVE. Access to the emulated XICS is via OPAL calls. The one complication is that the EOI (end-of-interrupt) function can now return a value indicating that another interrupt is pending; in this case, the XIVE will not signal an interrupt in hardware to the CPU, and software is supposed to acknowledge the new interrupt without waiting for another interrupt to be delivered in hardware. This adapts KVM-HV to use the OPAL calls on machines where there is no XICS hardware. When there is no XICS, we look for a device-tree node with "ibm,opal-intc" in its compatible property, which is how OPAL indicates that it provides XICS emulation. In order to handle the EOI return value, kvmppc_read_intr() has become kvmppc_read_one_intr(), with a boolean variable passed by reference which can be set by the EOI functions to indicate that another interrupt is pending. The new kvmppc_read_intr() keeps calling kvmppc_read_one_intr() until there are no more interrupts to process. The return value from kvmppc_read_intr() is the largest non-zero value of the returns from kvmppc_read_one_intr(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
1704a81c |
|
17-Nov-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9 On POWER9, the msgsnd instruction is able to send interrupts to other cores, as well as other threads on the local core. Since msgsnd is generally simpler and faster than sending an IPI via the XICS, we use msgsnd for all IPIs sent by KVM on POWER9. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
7c5b06ca |
|
17-Nov-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9 POWER9 adds new capabilities to the tlbie (TLB invalidate entry) and tlbiel (local tlbie) instructions. Both instructions get a set of new parameters (RIC, PRS and R) which appear as bits in the instruction word. The tlbiel instruction now has a second register operand, which contains a PID and/or LPID value if needed, and should otherwise contain 0. This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9 as well as older processors. Since we only handle HPT guests so far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction word as on previous processors, so we don't need to conditionally execute different instructions depending on the processor. The local flush on first entry to a guest in book3s_hv_rmhandlers.S is a loop which depends on the number of TLB sets. Rather than using feature sections to set the number of iterations based on which CPU we're on, we now work out this number at VM creation time and store it in the kvm_arch struct. That will make it possible to get the number from the device tree in future, which will help with compatibility with future processors. Since mmu_partition_table_set_entry() does a global flush of the whole LPID, we don't need to do the TLB flush on first entry to the guest on each processor. Therefore we don't set all bits in the tlb_need_flush bitmap on VM startup on POWER9. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
e9cf1e08 |
|
17-Nov-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs This adds code to handle two new guest-accessible special-purpose registers on POWER9: TIDR (thread ID register) and PSSCR (processor stop status and control register). They are context-switched between host and guest, and the guest values can be read and set via the one_reg interface. The PSSCR contains some fields which are guest-accessible and some which are only accessible in hypervisor mode. We only allow the guest-accessible fields to be read or set by userspace. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
7a84084c |
|
16-Nov-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9 On POWER9, the SDR1 register (hashed page table base address) is no longer used, and instead the hardware reads the HPT base address and size from the partition table. The partition table entry also contains the bits that specify the page size for the VRMA mapping, which were previously in the LPCR. The VPM0 bit of the LPCR is now reserved; the processor now always uses the VRMA (virtual real-mode area) mechanism for guest real-mode accesses in HPT mode, and the RMO (real-mode offset) mechanism has been dropped. When entering or exiting the guest, we now only have to set the LPIDR (logical partition ID register), not the SDR1 register. There is also no requirement now to transition via a reserved LPID value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
0d808df0 |
|
06-Nov-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Save/restore XER in checkpointed register state When switching from/to a guest that has a transaction in progress, we need to save/restore the checkpointed register state. Although XER is part of the CPU state that gets checkpointed, the code that does this saving and restoring doesn't save/restore XER. This fixes it by saving and restoring the XER. To allow userspace to read/write the checkpointed XER value, we also add a new ONE_REG specifier. The visible effect of this bug is that the guest may see its XER value being corrupted when it uses transactions. Fixes: e4e38121507a ("KVM: PPC: Book3S HV: Add transactional memory support") Fixes: 0a8eccefcb34 ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
a56ee9f8 |
|
03-Nov-2016 |
Yongji Xie <xyjxie@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries This keeps a per vcpu cache for recently page faulted MMIO entries. On a page fault, if the entry exists in the cache, we can avoid some time-consuming paths, for example, looking up HPT, locking HPTE twice and searching mmio gfn from memslots, then directly call kvmppc_hv_emulate_mmio(). In current implenment, we limit the size of cache to four. We think it's enough to cover the high-frequency MMIO HPTEs in most case. For example, considering the case of using virtio device, for virtio legacy devices, one HPTE could handle notifications from up to 1024 (64K page / 64 byte Port IO register) devices, so one cache entry is enough; for virtio modern devices, we always need one HPTE to handle notification for each device because modern device would use a 8M MMIO register to notify host instead of Port IO register, typically the system's configuration should not exceed four virtio devices per vcpu, four cache entry is also enough in this case. Of course, if needed, we could also modify the macro to a module parameter in the future. Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
28d057c8 |
|
17-Oct-2016 |
Wei Yongjun <weiyongjun1@huawei.com> |
KVM: PPC: Book3S HV: Use list_move_tail instead of list_del/list_add_tail Using list_move_tail() instead of list_del() + list_add_tail(). Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
b009031f |
|
15-Sep-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Take out virtual core piggybacking code This takes out the code that arranges to run two (or more) virtual cores on a single subcore when possible, that is, when both vcores are from the same VM, the VM is configured with one CPU thread per virtual core, and all the per-subcore registers have the same value in each vcore. Since the VTB (virtual timebase) is a per-subcore register, and will almost always differ between vcores, this code is disabled on POWER8 machines, meaning that it is only usable on POWER7 machines (which don't have VTB). Given the tiny number of POWER7 machines which have firmware that allows them to run HV KVM, the benefit of simplifying the code outweighs the loss of this feature on POWER7 machines. Tested-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
88b02cf9 |
|
14-Sep-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread POWER8 has one virtual timebase (VTB) register per subcore, not one per CPU thread. The HV KVM code currently treats VTB as a per-thread register, which can lead to spurious soft lockup messages from guests which use the VTB as the time source for the soft lockup detector. (CPUs before POWER8 did not have the VTB register.) For HV KVM, this fixes the problem by making only the primary thread in each virtual core save and restore the VTB value. With this, the VTB state becomes part of the kvmppc_vcore structure. This also means that "piggybacking" of multiple virtual cores onto one subcore is not possible on POWER8, because then the virtual cores would share a single VTB register. PR KVM emulates a VTB register, which is per-vcpu because PR KVM has no notion of CPU threads or SMT. For PR KVM we move the VTB state into the kvmppc_vcpu_book3s struct. Cc: stable@vger.kernel.org # v3.14+ Reported-by: Thomas Huth <thuth@redhat.com> Tested-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
5d375199 |
|
18-Aug-2016 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Set server for passed-through interrupts When a guest has a PCI pass-through device with an interrupt, it will direct the interrupt to a particular guest VCPU. In fact the physical interrupt might arrive on any CPU, and then get delivered to the target VCPU in the emulated XICS (guest interrupt controller), and eventually delivered to the target VCPU. Now that we have code to handle device interrupts in real mode without exiting to the host kernel, there is an advantage to having the device interrupt arrive on the same sub(core) as the target VCPU is running on. In this situation, the interrupt can be delivered to the target VCPU without any exit to the host kernel (using a hypervisor doorbell interrupt between threads if necessary). This patch aims to get passed-through device interrupts arriving on the correct core by setting the interrupt server in the real hardware XICS for the interrupt to the first thread in the (sub)core where its target VCPU is running. We do this in the real-mode H_EOI code because the H_EOI handler already needs to look at the emulated ICS state for the interrupt (whereas the H_XIRR handler doesn't), and we know we are running in the target VCPU context at that point. We set the server CPU in hardware using an OPAL call, regardless of what the IRQ affinity mask for the interrupt says, and without updating the affinity mask. This amounts to saying that when an interrupt is passed through to a guest, as a matter of policy we allow the guest's affinity for the interrupt to override the host's. This is inspired by an earlier patch from Suresh Warrier, although none of this code came from that earlier patch. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
644abbb2 |
|
18-Aug-2016 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Tunable to disable KVM IRQ bypass Add a module parameter kvm_irq_bypass for kvm_hv.ko to disable IRQ bypass for passthrough interrupts. The default value of this tunable is 1 - that is enable the feature. Since the tunable is used by built-in kernel code, we use the module_param_cb macro to achieve this. Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
f7af5209 |
|
18-Aug-2016 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Complete passthrough interrupt in host In existing real mode ICP code, when updating the virtual ICP state, if there is a required action that cannot be completely handled in real mode, as for instance, a VCPU needs to be woken up, flags are set in the ICP to indicate the required action. This is checked when returning from hypercalls to decide whether the call needs switch back to the host where the action can be performed in virtual mode. Note that if h_ipi_redirect is enabled, real mode code will first try to message a free host CPU to complete this job instead of returning the host to do it ourselves. Currently, the real mode PCI passthrough interrupt handling code checks if any of these flags are set and simply returns to the host. This is not good enough as the trap value (0x500) is treated as an external interrupt by the host code. It is only when the trap value is a hypercall that the host code searches for and acts on unfinished work by calling kvmppc_xics_rm_complete. This patch introduces a special trap BOOK3S_INTERRUPT_HV_RM_HARD which is returned by KVM if there is unfinished business to be completed in host virtual mode after handling a PCI passthrough interrupt. The host checks for this special interrupt condition and calls into the kvmppc_xics_rm_complete, which is made an exported function for this reason. [paulus@ozlabs.org - moved logic to set r12 to BOOK3S_INTERRUPT_HV_RM_HARD in book3s_hv_rmhandlers.S into the end of kvmppc_check_wake_reason.] Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
e3c13e56 |
|
18-Aug-2016 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Handle passthrough interrupts in guest Currently, KVM switches back to the host to handle any external interrupt (when the interrupt is received while running in the guest). This patch updates real-mode KVM to check if an interrupt is generated by a passthrough adapter that is owned by this guest. If so, the real mode KVM will directly inject the corresponding virtual interrupt to the guest VCPU's ICS and also EOI the interrupt in hardware. In short, the interrupt is handled entirely in real mode in the guest context without switching back to the host. In some rare cases, the interrupt cannot be completely handled in real mode, for instance, a VCPU that is sleeping needs to be woken up. In this case, KVM simply switches back to the host with trap reason set to 0x500. This works, but it is clearly not very efficient. A following patch will distinguish this case and handle it correctly in the host. Note that we can use the existing check_too_hard() routine even though we are not in a hypercall to determine if there is unfinished business that needs to be completed in host virtual mode. The patch assumes that the mapping between hardware interrupt IRQ and virtual IRQ to be injected to the guest already exists for the PCI passthrough interrupts that need to be handled in real mode. If the mapping does not exist, KVM falls back to the default existing behavior. The KVM real mode code reads mappings from the mapped array in the passthrough IRQ map without taking any lock. We carefully order the loads and stores of the fields in the kvmppc_irq_map data structure using memory barriers to avoid an inconsistent mapping being seen by the reader. Thus, although it is possible to miss a map entry, it is not possible to read a stale value. [paulus@ozlabs.org - get irq_chip from irq_map rather than pimap, pulled out powernv eoi change into a separate patch, made kvmppc_read_intr get the vcpu from the paca rather than being passed in, rewrote the logic at the end of kvmppc_read_intr to avoid deep indentation, simplified logic in book3s_hv_rmhandlers.S since we were always restoring SRR0/1 anyway, get rid of the cached array (just use the mapped array), removed the kick_all_cpus_sync() call, clear saved_xirr PACA field when we handle the interrupt in real mode, fix compilation with CONFIG_KVM_XICS=n.] Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
c57875f5 |
|
18-Aug-2016 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Enable IRQ bypass Add the irq_bypass_add_producer and irq_bypass_del_producer functions. These functions get called whenever a GSI is being defined for a guest. They create/remove the mapping between host real IRQ numbers and the guest GSI. Add the following helper functions to manage the passthrough IRQ map. kvmppc_set_passthru_irq() Creates a mapping in the passthrough IRQ map that maps a host IRQ to a guest GSI. It allocates the structure (one per guest VM) the first time it is called. kvmppc_clr_passthru_irq() Removes the passthrough IRQ map entry given a guest GSI. The passthrough IRQ map structure is not freed even when the number of mapped entries goes to zero. It is only freed when the VM is destroyed. [paulus@ozlabs.org - modified to use is_pnv_opal_msi() rather than requiring all passed-through interrupts to use the same irq_chip; changed deletion so it zeroes out the r_hwirq field rather than copying the last entry down and decrementing the number of entries.] Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
8daaafc8 |
|
18-Aug-2016 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Introduce kvmppc_passthru_irqmap This patch introduces an IRQ mapping structure, the kvmppc_passthru_irqmap structure that is to be used to map the real hardware IRQ in the host with the virtual hardware IRQ (gsi) that is injected into a guest by KVM for passthrough adapters. Currently, we assume a separate IRQ mapping structure for each guest. Each kvmppc_passthru_irqmap has a mapping arrays, containing all defined real<->virtual IRQs. [paulus@ozlabs.org - removed irq_chip field from struct kvmppc_passthru_irqmap; changed parameter for kvmppc_get_passthru_irqmap from struct kvm_vcpu * to struct kvm *, removed small cached array.] Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
2a27f514 |
|
01-Aug-2016 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Implement existing and add new halt polling vcpu stats vcpu stats are used to collect information about a vcpu which can be viewed in the debugfs. For example halt_attempted_poll and halt_successful_poll are used to keep track of the number of times the vcpu attempts to and successfully polls. These stats are currently not used on powerpc. Implement incrementation of the halt_attempted_poll and halt_successful_poll vcpu stats for powerpc. Since these stats are summed over all the vcpus for all running guests it doesn't matter which vcpu they are attributed to, thus we choose the current runner vcpu of the vcore. Also add new vcpu stats: halt_poll_success_ns, halt_poll_fail_ns and halt_wait_ns to be used to accumulate the total time spend polling successfully, polling unsuccessfully and waiting respectively, and halt_successful_wait to accumulate the number of times the vcpu waits. Given that halt_poll_success_ns, halt_poll_fail_ns and halt_wait_ns are expressed in nanoseconds it is necessary to represent these as 64-bit quantities, otherwise they would overflow after only about 4 seconds. Given that the total time spend either polling or waiting will be known and the number of times that each was done, it will be possible to determine the average poll and wait times. This will give the ability to tune the kvm module parameters based on the calculated average wait and poll times. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
0cda69dd |
|
01-Aug-2016 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Implement halt polling This patch introduces new halt polling functionality into the kvm_hv kernel module. When a vcore is idle it will poll for some period of time before scheduling itself out. When all of the runnable vcpus on a vcore have ceded (and thus the vcore is idle) we schedule ourselves out to allow something else to run. In the event that we need to wake up very quickly (for example an interrupt arrives), we are required to wait until we get scheduled again. Implement halt polling so that when a vcore is idle, and before scheduling ourselves, we poll for vcpus in the runnable_threads list which have pending exceptions or which leave the ceded state. If we poll successfully then we can get back into the guest very quickly without ever scheduling ourselves, otherwise we schedule ourselves out as before. There exists generic halt_polling code in virt/kvm_main.c, however on powerpc the polling conditions are different to the generic case. It would be nice if we could just implement an arch specific kvm_check_block() function, but there is still other arch specific things which need to be done for kvm_hv (for example manipulating vcore states) which means that a separate implementation is the best option. Testing of this patch with a TCP round robin test between two guests with virtio network interfaces has found a decrease in round trip time of ~15us on average. A performance gain is only seen when going out of and back into the guest often and quickly, otherwise there is no net benefit from the polling. The polling interval is adjusted such that when we are often scheduled out for long periods of time it is reduced, and when we often poll successfully it is increased. The rate at which the polling interval increases or decreases, and the maximum polling interval, can be set through module parameters. Based on the implementation in the generic kvm module by Wanpeng Li and Paolo Bonzini, and on direction from Paul Mackerras. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
7b5f8272 |
|
01-Aug-2016 |
Suraj Jitindar Singh <sjitindarsingh@gmail.com> |
KVM: PPC: Book3S HV: Change vcore element runnable_threads from linked-list to array The struct kvmppc_vcore is a structure used to store various information about a virtual core for a kvm guest. The runnable_threads element of the struct provides a list of all of the currently runnable vcpus on the core (those in the KVMPPC_VCPU_RUNNABLE state). The previous implementation of this list was a linked_list. The next patch requires that the list be able to be iterated over without holding the vcore lock. Reimplement the runnable_threads list in the kvmppc_vcore struct as an array. Implement function to iterate over valid entries in the array and update access sites accordingly. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
6edaa530 |
|
15-Jun-2016 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: remove kvm_guest_enter/exit wrappers Use the functions from context_tracking.h directly. Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
fd7bacbc |
|
14-May-2016 |
Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt When a guest is assigned to a core it converts the host Timebase (TB) into guest TB by adding guest timebase offset before entering into guest. During guest exit it restores the guest TB to host TB. This means under certain conditions (Guest migration) host TB and guest TB can differ. When we get an HMI for TB related issues the opal HMI handler would try fixing errors and restore the correct host TB value. With no guest running, we don't have any issues. But with guest running on the core we run into TB corruption issues. If we get an HMI while in the guest, the current HMI handler invokes opal hmi handler before forcing guest to exit. The guest exit path subtracts the guest TB offset from the current TB value which may have already been restored with host value by opal hmi handler. This leads to incorrect host and guest TB values. With split-core, things become more complex. With split-core, TB also gets split and each subcore gets its own TB register. When a hmi handler fixes a TB error and restores the TB value, it affects all the TB values of sibling subcores on the same core. On TB errors all the thread in the core gets HMI. With existing code, the individual threads call opal hmi handle independently which can easily throw TB out of sync if we have guest running on subcores. Hence we will need to co-ordinate with all the threads before making opal hmi handler call followed by TB resync. This patch introduces a sibling subcore state structure (shared by all threads in the core) in paca which holds information about whether sibling subcores are in Guest mode or host mode. An array in_guest[] of size MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore. The subcore id is used as index into in_guest[] array. Only primary thread entering/exiting the guest is responsible to set/unset its designated array element. On TB error, we get HMI interrupt on every thread on the core. Upon HMI, this patch will now force guest to vacate the core/subcore. Primary thread from each subcore will then turn off its respective bit from the above bitmap during the guest exit path just after the guest->host partition switch is complete. All other threads that have just exited the guest OR were already in host will wait until all other subcores clears their respective bit. Once all the subcores turn off their respective bit, all threads will will make call to opal hmi handler. It is not necessary that opal hmi handler would resync the TB value for every HMI interrupts. It would do so only for the HMI caused due to TB errors. For rest, it would not touch TB value. Hence to make things simpler, primary thread would call TB resync explicitly once for each core immediately after opal hmi handler instead of subtracting guest offset from TB. TB resync call will restore the TB with host value. Thus we can be sure about the TB state. One of the primary threads exiting the guest will take up the responsibility of calling TB resync. It will use one of the top bits (bit 63) from subcore state flags bitmap to make the decision. The first primary thread (among the subcores) that is able to set the bit will have to call the TB resync. Rest all other threads will wait until TB resync is complete. Once TB resync is complete all threads will then proceed. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
#
07f8ab25 |
|
10-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Fix build error in book3s_hv.c When CONFIG_KVM_XICS is enabled, CPU_UP_PREPARE and other macros for CPU states in linux/cpu.h are needed by arch/powerpc/kvm/book3s_hv.c. Otherwise, build error as below is seen: gwshan@gwshan:~/sandbox/l$ make arch/powerpc/kvm/book3s_hv.o : CC arch/powerpc/kvm/book3s_hv.o arch/powerpc/kvm/book3s_hv.c: In function ‘kvmppc_cpu_notify’: arch/powerpc/kvm/book3s_hv.c:3072:7: error: ‘CPU_UP_PREPARE’ \ undeclared (first use in this function) This fixes the issue introduced by commit <6f3bb80944> ("KVM: PPC: Book3S HV: kvmppc_host_rm_ops - handle offlining CPUs"). Fixes: 6f3bb8094414 Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
50de596d |
|
29-Apr-2016 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
powerpc/mm/hash: Add support for Power9 Hash PowerISA 3.0 adds a parition table indexed by LPID. Parition table allows us to specify the MMU model that will be used for guest and host translation. This patch adds support with SLB based hash model (UPRT = 0). What is required with this model is to support the new hash page table entry format and also setup partition table such that we use hash table for address translation. We don't have segment table support yet. In order to make sure we don't load KVM module on Power9 (since we don't have kvm support yet) this patch also disables KVM on Power9. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
520fe9c6 |
|
21-Dec-2015 |
Suresh E. Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Add tunable to control H_IPI redirection Redirecting the wakeup of a VCPU from the H_IPI hypercall to a core running in the host is usually a good idea, most workloads seemed to benefit. However, in one heavily interrupt-driven SMT1 workload, some regression was observed. This patch adds a kvm_hv module parameter called h_ipi_redirect to control this feature. The default value for this tunable is 1 - that is enable the feature. Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
0c2a6606 |
|
17-Dec-2015 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Host side kick VCPU when poked by real-mode KVM This patch adds the support for the kick VCPU operation for kvmppc_host_rm_ops. The kvmppc_xics_ipi_action() function provides the function to be invoked for a host side operation when poked by the real mode KVM. This is initiated by KVM by sending an IPI to any free host core. KVM real mode must set the rm_action to XICS_RM_KICK_VCPU and rm_data to point to the VCPU to be woken up before sending the IPI. Note that we have allocated one kvmppc_host_rm_core structure per core. The above values need to be set in the structure corresponding to the core to which the IPI will be sent. Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
6f3bb809 |
|
17-Dec-2015 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: kvmppc_host_rm_ops - handle offlining CPUs The kvmppc_host_rm_ops structure keeps track of which cores are are in the host by maintaining a bitmask of active/runnable online CPUs that have not entered the guest. This patch adds support to manage the bitmask when a CPU is offlined or onlined in the host. Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
b8e6a87c |
|
17-Dec-2015 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Manage core host state Update the core host state in kvmppc_host_rm_ops whenever the primary thread of the core enters the guest or returns back. Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
79b6c247 |
|
17-Dec-2015 |
Suresh Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Host-side RM data structures This patch defines the data structures to support the setting up of host side operations while running in real mode in the guest, and also the functions to allocate and free it. The operations are for now limited to virtual XICS operations. Currently, we have only defined one operation in the data structure: - Wake up a VCPU sleeping in the host when it receives a virtual interrupt The operations are assigned at the core level because PowerKVM requires that the host run in SMT off mode. For each core, we will need to manage its state atomically - where the state is defined by: 1. Is the core running in the host? 2. Is there a Real Mode (RM) operation pending on the host? Currently, core state is only managed at the whole-core level even when the system is in split-core mode. This just limits the number of free or "available" cores in the host to perform any host-side operations. The kvmppc_host_rm_core.rm_data allows any data to be passed by KVM in real mode to the host core along with the operation to be performed. The kvmppc_host_rm_ops structure is allocated the very first time a guest VM is started. Initial core state is also set - all online cores are in the host. This structure is never deleted, not even when there are no active guests. However, it needs to be freed when the module is unloaded because the kvmppc_host_rm_ops_hv can contain function pointers to kvm-hv.ko functions for the different supported host operations. Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
8577370f |
|
19-Feb-2016 |
Marcelo Tosatti <mtosatti@redhat.com> |
KVM: Use simple waitqueue for vcpu->wq The problem: On -rt, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: This patch reduces the average latency in my tests from 14us to 11us. Daniel writes: Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency benchmark on mainline. The test was run 1000 times on tip/sched/core 4.4.0-rc8-01134-g0905f04: ./x86-run x86/tscdeadline_latency.flat -cpu host with idle=poll. The test seems not to deliver really stable numbers though most of them are smaller. Paolo write: "Anything above ~10000 cycles means that the host went to C1 or lower---the number means more or less nothing in that case. The mean shows an improvement indeed." Before: min max mean std count 1000.000000 1000.000000 1000.000000 1000.000000 mean 5162.596000 2019270.084000 5824.491541 20681.645558 std 75.431231 622607.723969 89.575700 6492.272062 min 4466.000000 23928.000000 5537.926500 585.864966 25% 5163.000000 1613252.750000 5790.132275 16683.745433 50% 5175.000000 2281919.000000 5834.654000 23151.990026 75% 5190.000000 2382865.750000 5861.412950 24148.206168 max 5228.000000 4175158.000000 6254.827300 46481.048691 After min max mean std count 1000.000000 1000.00000 1000.000000 1000.000000 mean 5143.511000 2076886.10300 5813.312474 21207.357565 std 77.668322 610413.09583 86.541500 6331.915127 min 4427.000000 25103.00000 5529.756600 559.187707 25% 5148.000000 1691272.75000 5784.889825 17473.518244 50% 5160.000000 2308328.50000 5832.025000 23464.837068 75% 5172.000000 2393037.75000 5853.177675 24223.969976 max 5222.000000 3922458.00000 6186.720500 42520.379830 [Patch was originaly based on the swait implementation found in the -rt tree. Daniel ported it to mainline's version and gathered the benchmark numbers for tscdeadline_latency test.] Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-rt-users@vger.kernel.org Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
#
d3695aa4 |
|
14-Feb-2016 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
KVM: PPC: Add support for multiple-TCE hcalls This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition between kernel and user space. The current implementation of kvmppc_h_stuff_tce() allows it to be executed in both real and virtual modes so there is one helper. The kvmppc_rm_h_put_tce_indirect() needs to translate the guest address to the host address and since the translation is different, there are 2 helpers - one for each mode. This implements the KVM_CAP_PPC_MULTITCE capability. When present, the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE if these are enabled by the userspace via KVM_CAP_PPC_ENABLE_HCALL. If they can not be handled by the kernel, they are passed on to the user space. The user space still has to have an implementation for these. Both HV and PR-syle KVM are supported. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
c20875a3 |
|
11-Nov-2015 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR Currently it is possible for userspace (e.g. QEMU) to set a value for the MSR for a guest VCPU which has both of the TS bits set, which is an illegal combination. The result of this is that when we execute a hrfid (hypervisor return from interrupt doubleword) instruction to enter the guest, the CPU will take a TM Bad Thing type of program interrupt (vector 0x700). Now, if PR KVM is configured in the kernel along with HV KVM, we actually handle this without crashing the host or giving hypervisor privilege to the guest; instead what happens is that we deliver a program interrupt to the guest, with SRR0 reflecting the address of the hrfid instruction and SRR1 containing the MSR value at that point. If PR KVM is not configured in the kernel, then we try to run the host's program interrupt handler with the MMU set to the guest context, which almost certainly causes a host crash. This closes the hole by making kvmppc_set_msr_hv() check for the illegal combination and force the TS field to a safe value (00, meaning non-transactional). Cc: stable@vger.kernel.org # v3.9+ Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
1c9e3d51 |
|
11-Nov-2015 |
Paul Mackerras <paulus@ozlabs.org> |
KVM: PPC: Book3S HV: Handle unexpected traps in guest entry/exit code better As we saw with the TM Bad Thing type of program interrupt occurring on the hrfid that enters the guest, it is not completely impossible to have a trap occurring in the guest entry/exit code, despite the fact that the code has been written to avoid taking any traps. This adds a check in the kvmppc_handle_exit_hv() function to detect the case when a trap has occurred in the hypervisor-mode code, and instead of treating it just like a trap in guest code, we now print a message and return to userspace with a KVM_EXIT_INTERNAL_ERROR exit reason. Of the various interrupts that get handled in the assembly code in the guest exit path and that can return directly to the guest, the only one that can occur when MSR.HV=1 and MSR.EE=0 is machine check (other than system call, which we can avoid just by not doing a sc instruction). Therefore this adds code to the machine check path to ensure that if the MCE occurred in hypervisor mode, we exit to the host rather than trying to continue the guest. Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
579e633e |
|
28-Oct-2015 |
Anton Blanchard <anton@samba.org> |
powerpc: create flush_all_to_thread() Create a single function that flushes everything (FP, VMX, VSX, SPE). Doing this all at once means we only do one MSR write. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
e09fefde |
|
05-Nov-2015 |
David Hildenbrand <dahi@linux.vnet.ibm.com> |
KVM: Use common function for VCPU lookup by id Let's reuse the new common function for VPCU lookup by id. Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> [split out the new function into a separate patch]
|
#
f74f2e2e |
|
02-Nov-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Don't dynamically split core when already split In static micro-threading modes, the dynamic micro-threading code is supposed to be disabled, because subcores can't make independent decisions about what micro-threading mode to put the core in - there is only one micro-threading mode for the whole core. The code that implements dynamic micro-threading checks for this, except that the check was missed in one case. This means that it is possible for a subcore in static 2-way micro-threading mode to try to put the core into 4-way micro-threading mode, which usually leads to stuck CPUs, spinlock lockups, and other stalls in the host. The problem was in the can_split_piggybacked_subcores() function, which should always return false if the system is in a static micro-threading mode. This fixes the problem by making can_split_piggybacked_subcores() use subcore_config_ok() for its checks, as subcore_config_ok() includes the necessary check for the static micro-threading modes. Credit to Gautham Shenoy for working out that the reason for the hangs and stalls we were seeing was that we were trying to do dynamic 4-way micro-threading while we were in static 2-way mode. Fixes: b4deba5c41e9 Cc: vger@stable.kernel.org # v4.3 Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
23316316 |
|
20-Oct-2015 |
Paul Mackerras <paulus@samba.org> |
powerpc: Revert "Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8" This reverts commit 9678cdaae939 ("Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8") because the original commit had multiple, partly self-cancelling bugs, that could cause occasional memory corruption. In fact the logmpp instruction was incorrectly using register r0 as the source of the buffer address and operation code, and depending on what was in r0, it would either do nothing or corrupt the 64k page pointed to by r0. The logmpp instruction encoding and the operation code definitions could be corrected, but then there is the problem that there is no clearly defined way to know when the hardware has finished writing to the buffer. The original commit attempted to work around this by aborting the write-out before starting the prefetch, but this is ineffective in the case where the virtual core is now executing on a different physical core from the one where the write-out was initiated. These problems plus advice from the hardware designers not to use the function (since the measured performance improvement from using the feature was actually mostly negative), mean that reverting the code is the best option. Fixes: 9678cdaae939 ("Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8") Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
5fc3e64f |
|
17-Sep-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fix handling of interrupted VCPUs This fixes a bug which results in stale vcore pointers being left in the per-cpu preempted vcore lists when a VM is destroyed. The result of the stale vcore pointers is usually either a crash or a lockup inside collect_piggybacks() when another VM is run. A typical lockup message looks like: [ 472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039] [ 472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv] [ 472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49 [ 472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000 [ 472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130 [ 472.162337] REGS: c000001e41bff520 TRAP: 0901 Not tainted (4.2.0-kvm+) [ 472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24848844 XER: 00000000 [ 472.162588] CFAR: c00000000096b0ac SOFTE: 1 GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001 GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821 GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740 GPR12: c000000000111130 c00000000fdae400 [ 472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130 [ 472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130 [ 472.163179] Call Trace: [ 472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable) [ 472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90 [ 472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv] [ 472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv] [ 472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm] [ 472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm] [ 472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm] [ 472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0 [ 472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0 [ 472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0 [ 472.164060] Instruction dump: [ 472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2 [ 472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070 The bug is that kvmppc_run_vcpu does not correctly handle the case where a vcpu task receives a signal while its guest vcpu is executing in the guest as a result of being piggy-backed onto the execution of another vcore. In that case we need to wait for the vcpu to finish executing inside the guest, and then remove this vcore from the preempted vcores list. That way, we avoid leaving this vcpu's vcore on the preempted vcores list when the vcpu gets interrupted. Fixes: ec2571650826 Reported-by: Thomas Huth <thuth@redhat.com> Tested-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
7f235328 |
|
02-Sep-2015 |
Gautham R. Shenoy <ego@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Fix race in starting secondary threads The current dynamic micro-threading code has a race due to which a secondary thread naps when it is supposed to be running a vcpu. As a side effect of this, on a guest exit, the primary thread in kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared its vcore pointer. This results in "CPU X seems to be stuck!" warnings. The race is possible since the primary thread on exiting the guests only waits for all the secondaries to clear its vcore pointer. It subsequently expects the secondary threads to enter nap while it unsplits the core. A secondary thread which hasn't yet entered the nap will loop in kvm_no_guest until its vcore pointer and the do_nap flag are unset. Once the core has been unsplit, a new vcpu thread can grab the core and set the do_nap flag *before* setting the vcore pointers of the secondary. As a result, the secondary thread will now enter nap via kvm_unsplit_nap instead of running the guest vcpu. Fix this by setting the do_nap flag after setting the vcore pointer in the PACA of the secondary in kvmppc_run_core. Also, ensure that a secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer in its PACA struct is set. Fixes: b4deba5c41e9 Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
563a1e93 |
|
16-Jul-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fix preempted vcore stolen time calculation Whenever a vcore state is VCORE_PREEMPT we need to be counting stolen time for it. This currently isn't the case when we have a vcore that no longer has any runnable threads in it but still has a runner task, so we do an explicit call to kvmppc_core_start_stolen() in that case. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
402813fe |
|
16-Jul-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fix preempted vcore list locking When a vcore gets preempted, we put it on the preempted vcore list for the current CPU. The runner task then calls schedule() and comes back some time later and takes itself off the list. We need to be careful to lock the list that it was put onto, which may not be the list for the current CPU since the runner task may have moved to another CPU. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
b4deba5c |
|
02-Jul-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8 This builds on the ability to run more than one vcore on a physical core by using the micro-threading (split-core) modes of the POWER8 chip. Previously, only vcores from the same VM could be run together, and (on POWER8) only if they had just one thread per core. With the ability to split the core on guest entry and unsplit it on guest exit, we can run up to 8 vcpu threads from up to 4 different VMs, and we can run multiple vcores with 2 or 4 vcpus per vcore. Dynamic micro-threading is only available if the static configuration of the cores is whole-core mode (unsplit), and only on POWER8. To manage this, we introduce a new kvm_split_mode struct which is shared across all of the subcores in the core, with a pointer in the paca on each thread. In addition we extend the core_info struct to have information on each subcore. When deciding whether to add a vcore to the set already on the core, we now have two possibilities: (a) piggyback the vcore onto an existing subcore, or (b) start a new subcore. Currently, when any vcpu needs to exit the guest and switch to host virtual mode, we interrupt all the threads in all subcores and switch the core back to whole-core mode. It may be possible in future to allow some of the subcores to keep executing in the guest while subcore 0 switches to the host, but that is not implemented in this patch. This adds a module parameter called dynamic_mt_modes which controls which micro-threading (split-core) modes the code will consider, as a bitmap. In other words, if it is 0, no micro-threading mode is considered; if it is 2, only 2-way micro-threading is considered; if it is 4, only 4-way, and if it is 6, both 2-way and 4-way micro-threading mode will be considered. The default is 6. With this, we now have secondary threads which are the primary thread for their subcore and therefore need to do the MMU switch. These threads will need to be started even if they have no vcpu to run, so we use the vcore pointer in the PACA rather than the vcpu pointer to trigger them. It is now possible for thread 0 to find that an exit has been requested before it gets to switch the subcore state to the guest. In that case we haven't added the guest's timebase offset to the timebase, so we need to be careful not to subtract the offset in the guest exit path. In fact we just skip the whole path that switches back to host context, since we haven't switched to the guest context. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
ec257165 |
|
24-Jun-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Make use of unused threads when running guests When running a virtual core of a guest that is configured with fewer threads per core than the physical cores have, the extra physical threads are currently unused. This makes it possible to use them to run one or more other virtual cores from the same guest when certain conditions are met. This applies on POWER7, and on POWER8 to guests with one thread per virtual core. (It doesn't apply to POWER8 guests with multiple threads per vcore because they require a 1-1 virtual to physical thread mapping in order to be able to use msgsndp and the TIR.) The idea is that we maintain a list of preempted vcores for each physical cpu (i.e. each core, since the host runs single-threaded). Then, when a vcore is about to run, it checks to see if there are any vcores on the list for its physical cpu that could be piggybacked onto this vcore's execution. If so, those additional vcores are put into state VCORE_PIGGYBACK and their runnable VCPU threads are started as well as the original vcore, which is called the master vcore. After the vcores have exited the guest, the extra ones are put back onto the preempted list if any of their VCPUs are still runnable and not idle. This means that vcpu->arch.ptid is no longer necessarily the same as the physical thread that the vcpu runs on. In order to make it easier for code that wants to send an IPI to know which CPU to target, we now store that in a new field in struct vcpu_arch, called thread_cpu. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
5358a963 |
|
22-May-2015 |
Thomas Huth <thuth@redhat.com> |
KVM: PPC: Fix warnings from sparse When compiling the KVM code for POWER with "make C=1", sparse complains about functions missing proper prototypes and a 64-bit constant missing the ULL prefix. Let's fix this by making the functions static or by including the proper header with the prototypes, and by appending a ULL prefix to the constant PPC_MPPE_ADDRESS_MASK. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
c56dadf3 |
|
14-Jul-2015 |
Konstantin Khlebnikov <koct9i@gmail.com> |
sched/preempt, powerpc, kvm: Use need_resched() instead of should_resched() Function should_resched() is equal to (!preempt_count() && need_resched()). In preemptive kernel preempt_count here is non-zero because of vc->lock. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Graf <agraf@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150715095203.12246.72922.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
f36f3f28 |
|
18-May-2015 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: add "new" argument to kvm_arch_commit_memory_region This lets the function access the new memory slot without going through kvm_memslots and id_to_memslot. It will simplify the code when more than one address space will be supported. Unfortunately, the "const"ness of the new argument must be casted away in two places. Fixing KVM to accept const struct kvm_memory_slot pointers would require modifications in pretty much all architectures, and is left for later. Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
09170a49 |
|
18-May-2015 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: const-ify uses of struct kvm_userspace_memory_region Architecture-specific helpers are not supposed to muck with struct kvm_userspace_memory_region contents. Add const to enforce this. In order to eliminate the only write in __kvm_set_memory_region, the cleaning of deleted slots is pulled up from update_memslots to __kvm_set_memory_region. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
9f6b8029 |
|
17-May-2015 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: use kvm_memslots whenever possible kvm_memslots provides lockdep checking. Use it consistently instead of explicit dereferencing of kvm->memslots. Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
17d48901 |
|
28-Apr-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fix list traversal in error case This fixes a regression introduced in commit 25fedfca94cf, "KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu", which leads to a user-triggerable oops. In the case where we try to run a vcore on a physical core that is not in single-threaded mode, or the vcore has too many threads for the physical core, we iterate the list of runnable vcpus to make each one return an EBUSY error to userspace. Since this involves taking each vcpu off the runnable_threads list for the vcore, we need to use list_for_each_entry_safe rather than list_for_each_entry to traverse the list. Otherwise the kernel will crash with an oops message like this: Unable to handle kernel paging request for data at address 0x000fff88 Faulting instruction address: 0xd00000001e635dc8 Oops: Kernel access of bad area, sig: 11 [#2] SMP NR_CPUS=1024 NUMA PowerNV ... CPU: 48 PID: 91256 Comm: qemu-system-ppc Tainted: G D 3.18.0 #1 task: c00000274e507500 ti: c0000027d1924000 task.ti: c0000027d1924000 NIP: d00000001e635dc8 LR: d00000001e635df8 CTR: c00000000011ba50 REGS: c0000027d19275b0 TRAP: 0300 Tainted: G D (3.18.0) MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22002824 XER: 00000000 CFAR: c000000000008468 DAR: 00000000000fff88 DSISR: 40000000 SOFTE: 1 GPR00: d00000001e635df8 c0000027d1927830 d00000001e64c850 0000000000000001 GPR04: 0000000000000001 0000000000000001 0000000000000000 0000000000000000 GPR08: 0000000000200200 0000000000000000 0000000000000000 d00000001e63e588 GPR12: 0000000000002200 c000000007dbc800 c000000fc7800000 000000000000000a GPR16: fffffffffffffffc c000000fd5439690 c000000fc7801c98 0000000000000001 GPR20: 0000000000000003 c0000027d1927aa8 c000000fd543b348 c000000fd543b350 GPR24: 0000000000000000 c000000fa57f0000 0000000000000030 0000000000000000 GPR28: fffffffffffffff0 c000000fd543b328 00000000000fe468 c000000fd543b300 NIP [d00000001e635dc8] kvmppc_run_core+0x198/0x17c0 [kvm_hv] LR [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv] Call Trace: [c0000027d1927830] [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv] (unreliable) [c0000027d1927a30] [d00000001e638350] kvmppc_vcpu_run_hv+0x5b0/0xdd0 [kvm_hv] [c0000027d1927b70] [d00000001e510504] kvmppc_vcpu_run+0x44/0x60 [kvm] [c0000027d1927ba0] [d00000001e50d4a4] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm] [c0000027d1927be0] [d00000001e504be8] kvm_vcpu_ioctl+0x5e8/0x7a0 [kvm] [c0000027d1927d40] [c0000000002d6720] do_vfs_ioctl+0x490/0x780 [c0000027d1927de0] [c0000000002d6ae4] SyS_ioctl+0xd4/0xf0 [c0000027d1927e30] [c000000000009358] syscall_exit+0x0/0x98 Instruction dump: 60000000 60420000 387e1b30 38800003 38a00001 38c00000 480087d9 e8410018 ebde1c98 7fbdf040 3bdee368 419e0048 <813e1b20> 939e1b18 2f890001 409effcc ---[ end trace 8cdf50251cca6680 ]--- Fixes: 25fedfca94cfbf2461314c6c34ef58e74a31b025 Signed-off-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Alexander Graf <agraf@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
66feed61 |
|
27-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8 This uses msgsnd where possible for signalling other threads within the same core on POWER8 systems, rather than IPIs through the XICS interrupt controller. This includes waking secondary threads to run the guest, the interrupts generated by the virtual XICS, and the interrupts to bring the other threads out of the guest when exiting. Aggregated statistics from debugfs across vcpus for a guest with 32 vcpus, 8 threads/vcore, running on a POWER8, show this before the change: rm_entry: 3387.6ns (228 - 86600, 1008969 samples) rm_exit: 4561.5ns (12 - 3477452, 1009402 samples) rm_intr: 1660.0ns (12 - 553050, 3600051 samples) and this after the change: rm_entry: 3060.1ns (212 - 65138, 953873 samples) rm_exit: 4244.1ns (12 - 9693408, 954331 samples) rm_intr: 1342.3ns (12 - 1104718, 3405326 samples) for a test of booting Fedora 20 big-endian to the login prompt. The time taken for a H_PROD hcall (which is handled in the host kernel) went down from about 35 microseconds to about 16 microseconds with this change. The noinline added to kvmppc_run_core turned out to be necessary for good performance, at least with gcc 4.9.2 as packaged with Fedora 21 and a little-endian POWER8 host. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
7d6c40da |
|
27-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Use bitmap of active threads rather than count Currently, the entry_exit_count field in the kvmppc_vcore struct contains two 8-bit counts, one of the threads that have started entering the guest, and one of the threads that have started exiting the guest. This changes it to an entry_exit_map field which contains two bitmaps of 8 bits each. The advantage of doing this is that it gives us a bitmap of which threads need to be signalled when exiting the guest. That means that we no longer need to use the trick of setting the HDEC to 0 to pull the other threads out of the guest, which led in some cases to a spurious HDEC interrupt on the next guest entry. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
5d5b99cd |
|
27-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken We can tell when a secondary thread has finished running a guest by the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there is no real need for the nap_count field in the kvmppc_vcore struct. This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu pointers of the secondary threads rather than polling vc->nap_count. Besides reducing the size of the kvmppc_vcore struct by 8 bytes, this also means that we can tell which secondary threads have got stuck and thus print a more informative error message. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
25fedfca |
|
27-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu Rather than calling cond_resched() in kvmppc_run_core() before doing the post-processing for the vcpus that we have just run (that is, calling kvmppc_handle_exit_hv(), kvmppc_set_timer(), etc.), we now do that post-processing before calling cond_resched(), and that post- processing is moved out into its own function, post_guest_process(). The reschedule point is now in kvmppc_run_vcpu() and we define a new vcore state, VCORE_PREEMPT, to indicate that that the vcore's runner task is runnable but not running. (Doing the reschedule with the vcore in VCORE_INACTIVE state would be bad because there are potentially other vcpus waiting for the runner in kvmppc_wait_for_exec() which then wouldn't get woken up.) Also, we make use of the handy cond_resched_lock() function, which unlocks and relocks vc->lock for us around the reschedule. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
d911f0be |
|
27-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update Previously, if kvmppc_run_core() was running a VCPU that needed a VPA update (i.e. one of its 3 virtual processor areas needed to be pinned in memory so the host real mode code can update it on guest entry and exit), we would drop the vcore lock and do the update there and then. Future changes will make it inconvenient to drop the lock, so instead we now remove it from the list of runnable VCPUs and wake up its VCPU task. This will have the effect that the VCPU task will exit kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call to kvmppc_update_vpas() and then rejoin the vcore. The one complication is that the runner VCPU (whose VCPU task is the current task) might be one of the ones that gets removed from the runnable list. In that case we just return from kvmppc_run_core() and let the code in kvmppc_run_vcpu() wake up another VCPU task to be the runner if necessary. This all means that the VCORE_STARTING state is no longer used, so we remove it. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
b6c295df |
|
27-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Accumulate timing information for real-mode code This reads the timebase at various points in the real-mode guest entry/exit code and uses that to accumulate total, minimum and maximum time spent in those parts of the code. Currently these times are accumulated per vcpu in 5 parts of the code: * rm_entry - time taken from the start of kvmppc_hv_entry() until just before entering the guest. * rm_intr - time from when we take a hypervisor interrupt in the guest until we either re-enter the guest or decide to exit to the host. This includes time spent handling hcalls in real mode. * rm_exit - time from when we decide to exit the guest until the return from kvmppc_hv_entry(). * guest - time spend in the guest * cede - time spent napping in real mode due to an H_CEDE hcall while other threads in the same vcore are active. These times are exposed in debugfs in a directory per vcpu that contains a file called "timings". This file contains one line for each of the 5 timings above, with the name followed by a colon and 4 numbers, which are the count (number of times the code has been executed), the total time, the minimum time, and the maximum time, all in nanoseconds. The overhead of the extra code amounts to about 30ns for an hcall that is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since production environments may not wish to incur this overhead, the new code is conditional on a new config symbol, CONFIG_KVM_BOOK3S_HV_EXIT_TIMING. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
e23a808b |
|
27-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT This creates a debugfs directory for each HV guest (assuming debugfs is enabled in the kernel config), and within that directory, a file by which the contents of the guest's HPT (hashed page table) can be read. The directory is named vmnnnn, where nnnn is the PID of the process that created the guest. The file is named "htab". This is intended to help in debugging problems in the host's management of guest memory. The contents of the file consist of a series of lines like this: 3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196 The first field is the index of the entry in the HPT, the second and third are the HPT entry, so the third entry contains the real page number that is mapped by the entry if the entry's valid bit is set. The fourth field is the guest's view of the second doubleword of the entry, so it contains the guest physical address. (The format of the second through fourth fields are described in the Power ISA and also in arch/powerpc/include/asm/mmu-hash64.h.) Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
31037eca |
|
20-Mar-2015 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Remove RMA-related variables from code We don't support real-mode areas now that 970 support is removed. Remove the remaining details of rma from the code. Also rename rma_setup_done to hpte_setup_done to better reflect the changes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
99342cf8 |
|
04-Feb-2015 |
David Gibson <david@gibson.dropbear.id.au> |
kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [agraf: fix compilation] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
ecb6d618 |
|
20-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Endian fix for accessing VPA yield count The VPA (virtual processor area) is defined by PAPR and is therefore big-endian, so we need a be32_to_cpu when reading it in kvmppc_get_yield_count(). Without this, H_CONFER always fails on a little-endian host, causing SMP guests to waste time spinning on spinlocks. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
8f902b00 |
|
20-Mar-2015 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fix spinlock/mutex ordering issue in kvmppc_set_lpcr() Currently, kvmppc_set_lpcr() has a spinlock around the whole function, and inside that does mutex_lock(&kvm->lock). It is not permitted to take a mutex while holding a spinlock, because the mutex_lock might call schedule(). In addition, this causes lockdep to warn about a lock ordering issue: ====================================================== [ INFO: possible circular locking dependency detected ] 3.18.0-kvm-04645-gdfea862-dirty #131 Not tainted ------------------------------------------------------- qemu-system-ppc/8179 is trying to acquire lock: (&kvm->lock){+.+.+.}, at: [<d00000000ecc1f54>] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv] but task is already holding lock: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecc1ea0>] .kvmppc_set_lpcr+0x40/0x1c0 [kvm_hv] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&vcore->lock)->rlock){+.+...}: [<c000000000b3c120>] .mutex_lock_nested+0x80/0x570 [<d00000000ecc7a14>] .kvmppc_vcpu_run_hv+0xc4/0xe40 [kvm_hv] [<d00000000eb9f5cc>] .kvmppc_vcpu_run+0x2c/0x40 [kvm] [<d00000000eb9cb24>] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm] [<d00000000eb94478>] .kvm_vcpu_ioctl+0x4a8/0x7b0 [kvm] [<c00000000026cbb4>] .do_vfs_ioctl+0x444/0x770 [<c00000000026cfa4>] .SyS_ioctl+0xc4/0xe0 [<c000000000009264>] syscall_exit+0x0/0x98 -> #0 (&kvm->lock){+.+.+.}: [<c0000000000ff28c>] .lock_acquire+0xcc/0x1a0 [<c000000000b3c120>] .mutex_lock_nested+0x80/0x570 [<d00000000ecc1f54>] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv] [<d00000000ecc510c>] .kvmppc_set_one_reg_hv+0x4dc/0x990 [kvm_hv] [<d00000000eb9f234>] .kvmppc_set_one_reg+0x44/0x330 [kvm] [<d00000000eb9c9dc>] .kvm_vcpu_ioctl_set_one_reg+0x5c/0x150 [kvm] [<d00000000eb9ced4>] .kvm_arch_vcpu_ioctl+0x214/0x2c0 [kvm] [<d00000000eb940b0>] .kvm_vcpu_ioctl+0xe0/0x7b0 [kvm] [<c00000000026cbb4>] .do_vfs_ioctl+0x444/0x770 [<c00000000026cfa4>] .SyS_ioctl+0xc4/0xe0 [<c000000000009264>] syscall_exit+0x0/0x98 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&vcore->lock)->rlock); lock(&kvm->lock); lock(&(&vcore->lock)->rlock); lock(&kvm->lock); *** DEADLOCK *** 2 locks held by qemu-system-ppc/8179: #0: (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f18>] .vcpu_load+0x28/0x90 [kvm] #1: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecc1ea0>] .kvmppc_set_lpcr+0x40/0x1c0 [kvm_hv] stack backtrace: CPU: 4 PID: 8179 Comm: qemu-system-ppc Not tainted 3.18.0-kvm-04645-gdfea862-dirty #131 Call Trace: [c000001a66c0f310] [c000000000b486ac] .dump_stack+0x88/0xb4 (unreliable) [c000001a66c0f390] [c0000000000f8bec] .print_circular_bug+0x27c/0x3d0 [c000001a66c0f440] [c0000000000fe9e8] .__lock_acquire+0x2028/0x2190 [c000001a66c0f5d0] [c0000000000ff28c] .lock_acquire+0xcc/0x1a0 [c000001a66c0f6a0] [c000000000b3c120] .mutex_lock_nested+0x80/0x570 [c000001a66c0f7c0] [d00000000ecc1f54] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv] [c000001a66c0f860] [d00000000ecc510c] .kvmppc_set_one_reg_hv+0x4dc/0x990 [kvm_hv] [c000001a66c0f8d0] [d00000000eb9f234] .kvmppc_set_one_reg+0x44/0x330 [kvm] [c000001a66c0f960] [d00000000eb9c9dc] .kvm_vcpu_ioctl_set_one_reg+0x5c/0x150 [kvm] [c000001a66c0f9f0] [d00000000eb9ced4] .kvm_arch_vcpu_ioctl+0x214/0x2c0 [kvm] [c000001a66c0faf0] [d00000000eb940b0] .kvm_vcpu_ioctl+0xe0/0x7b0 [kvm] [c000001a66c0fcb0] [c00000000026cbb4] .do_vfs_ioctl+0x444/0x770 [c000001a66c0fd90] [c00000000026cfa4] .SyS_ioctl+0xc4/0xe0 [c000001a66c0fe30] [c000000000009264] syscall_exit+0x0/0x98 This fixes it by moving the mutex_lock()/mutex_unlock() pair outside the spin-locked region. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
90fd09f8 |
|
02-Dec-2014 |
Sam Bobroff <sam.bobroff@au1.ibm.com> |
KVM: PPC: Book3S HV: Improve H_CONFER implementation Currently the H_CONFER hcall is implemented in kernel virtual mode, meaning that whenever a guest thread does an H_CONFER, all the threads in that virtual core have to exit the guest. This is bad for performance because it interrupts the other threads even if they are doing useful work. The H_CONFER hcall is called by a guest VCPU when it is spinning on a spinlock and it detects that the spinlock is held by a guest VCPU that is currently not running on a physical CPU. The idea is to give this VCPU's time slice to the holder VCPU so that it can make progress towards releasing the lock. To avoid having the other threads exit the guest unnecessarily, we add a real-mode implementation of H_CONFER that checks whether the other threads are doing anything. If all the other threads are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER), it returns H_TOO_HARD which causes a guest exit and allows the H_CONFER to be handled in virtual mode. Otherwise it spins for a short time (up to 10 microseconds) to give other threads the chance to observe that this thread is trying to confer. The spin loop also terminates when any thread exits the guest or when all other threads are idle or trying to confer. If the timeout is reached, the H_CONFER returns H_SUCCESS. In this case the guest VCPU will recheck the spinlock word and most likely call H_CONFER again. This also improves the implementation of the H_CONFER virtual mode handler. If the VCPU is part of a virtual core (vcore) which is runnable, there will be a 'runner' VCPU which has taken responsibility for running the vcore. In this case we yield to the runner VCPU rather than the target VCPU. We also introduce a check on the target VCPU's yield count: if it differs from the yield count passed to H_CONFER, the target VCPU has run since H_CONFER was called and may have already released the lock. This check is required by PAPR. Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
4a157d61 |
|
02-Dec-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register There are two ways in which a guest instruction can be obtained from the guest in the guest exit code in book3s_hv_rmhandlers.S. If the exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal instruction), the offending instruction is in the HEIR register (Hypervisor Emulation Instruction Register). If the exit was caused by a load or store to an emulated MMIO device, we load the instruction from the guest by turning data relocation on and loading the instruction with an lwz instruction. Unfortunately, in the case where the guest has opposite endianness to the host, these two methods give results of different endianness, but both get put into vcpu->arch.last_inst. The HEIR value has been loaded using guest endianness, whereas the lwz will load the instruction using host endianness. The rest of the code that uses vcpu->arch.last_inst assumes it was loaded using host endianness. To fix this, we define a new vcpu field to store the HEIR value. Then, in kvmppc_handle_exit_hv(), we transfer the value from this new field to vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness differ. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
c17b98cf |
|
02-Dec-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Remove code for PPC970 processors This removes the code that was added to enable HV KVM to work on PPC970 processors. The PPC970 is an old CPU that doesn't support virtualizing guest memory. Removing PPC970 support also lets us remove the code for allocating and managing contiguous real-mode areas, the code for the !kvm->arch.using_mmu_notifiers case, the code for pinning pages of guest memory when first accessed and keeping track of which pages have been pinned, and the code for handling H_ENTER hypercalls in virtual mode. Book3S HV KVM is now supported only on POWER7 and POWER8 processors. The KVM_CAP_PPC_RMA capability now always returns 0. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
3c78f78a |
|
03-Dec-2014 |
Suresh E. Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions This patch adds trace points in the guest entry and exit code and also for exceptions handled by the host in kernel mode - hypercalls and page faults. The new events are added to /sys/kernel/debug/tracing/events under a new subsystem called kvm_hv. Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
2711e248 |
|
03-Dec-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Simplify locking around stolen time calculations Currently the calculations of stolen time for PPC Book3S HV guests uses fields in both the vcpu struct and the kvmppc_vcore struct. The fields in the kvmppc_vcore struct are protected by the vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for running the virtual core. This works correctly but confuses lockdep, because it sees that the code takes the tbacct_lock for a vcpu in kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in vcore_stolen_time(), and it thinks there is a possibility of deadlock, causing it to print reports like this: ============================================= [ INFO: possible recursive locking detected ] 3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted --------------------------------------------- qemu-system-ppc/6188 is trying to acquire lock: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv] but task is already holding lock: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&vcpu->arch.tbacct_lock)->rlock); lock(&(&vcpu->arch.tbacct_lock)->rlock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by qemu-system-ppc/6188: #0: (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm] #1: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv] #2: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv] stack backtrace: CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89 Call Trace: [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable) [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190 [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0 [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70 [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv] [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv] [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv] [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm] [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm] [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm] [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770 [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0 [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98 In order to make the locking easier to analyse, we change the code to use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and preempt_tb fields. This lock needs to be an irq-safe lock since it is used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv() functions, which are called with the scheduler rq lock held, which is an irq-safe lock. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
1bc5d59c |
|
02-Nov-2014 |
Suresh E. Warrier <warrier@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked The kvmppc_vcore_blocked() code does not check for the wait condition after putting the process on the wait queue. This means that it is possible for an external interrupt to become pending, but the vcpu to remain asleep until the next decrementer interrupt. The fix is to make one last check for pending exceptions and ceded state before calling schedule(). Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
dee6f24c |
|
02-Nov-2014 |
Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> |
KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI When we get an HMI (hypervisor maintenance interrupt) while in a guest, we see that guest enters into paused state. The reason is, in kvmppc_handle_exit_hv it falls through default path and returns to host instead of resuming guest. This causes guest to enter into paused state. HMI is a hypervisor only interrupt and it is safe to resume the guest since the host has handled it already. This patch adds a switch case to resume the guest. Without this patch we see guest entering into paused state with following console messages: [ 3003.329351] Severe Hypervisor Maintenance interrupt [Recovered] [ 3003.329356] Error detail: Timer facility experienced an error [ 3003.329359] HMER: 0840000000000000 [ 3003.329360] TFMR: 4a12000980a84000 [ 3003.329366] vcpu c0000007c35094c0 (40): [ 3003.329368] pc = c0000000000c2ba0 msr = 8000000000009032 trap = e60 [ 3003.329370] r 0 = c00000000021ddc0 r16 = 0000000000000046 [ 3003.329372] r 1 = c00000007a02bbd0 r17 = 00003ffff27d5d98 [ 3003.329375] r 2 = c0000000010980b8 r18 = 00001fffffc9a0b0 [ 3003.329377] r 3 = c00000000142d6b8 r19 = c00000000142d6b8 [ 3003.329379] r 4 = 0000000000000002 r20 = 0000000000000000 [ 3003.329381] r 5 = c00000000524a110 r21 = 0000000000000000 [ 3003.329383] r 6 = 0000000000000001 r22 = 0000000000000000 [ 3003.329386] r 7 = 0000000000000000 r23 = c00000000524a110 [ 3003.329388] r 8 = 0000000000000000 r24 = 0000000000000001 [ 3003.329391] r 9 = 0000000000000001 r25 = c00000007c31da38 [ 3003.329393] r10 = c0000000014280b8 r26 = 0000000000000002 [ 3003.329395] r11 = 746f6f6c2f68656c r27 = c00000000524a110 [ 3003.329397] r12 = 0000000028004484 r28 = c00000007c31da38 [ 3003.329399] r13 = c00000000fe01400 r29 = 0000000000000002 [ 3003.329401] r14 = 0000000000000046 r30 = c000000003011e00 [ 3003.329403] r15 = ffffffffffffffba r31 = 0000000000000002 [ 3003.329404] ctr = c00000000041a670 lr = c000000000272520 [ 3003.329405] srr0 = c00000000007e8d8 srr1 = 9000000000001002 [ 3003.329406] sprg0 = 0000000000000000 sprg1 = c00000000fe01400 [ 3003.329407] sprg2 = c00000000fe01400 sprg3 = 0000000000000005 [ 3003.329408] cr = 48004482 xer = 2000000000000000 dsisr = 42000000 [ 3003.329409] dar = 0000010015020048 [ 3003.329410] fault dar = 0000010015020048 dsisr = 42000000 [ 3003.329411] SLB (8 entries): [ 3003.329412] ESID = c000000008000000 VSID = 40016e7779000510 [ 3003.329413] ESID = d000000008000001 VSID = 400142add1000510 [ 3003.329414] ESID = f000000008000004 VSID = 4000eb1a81000510 [ 3003.329415] ESID = 00001f000800000b VSID = 40004fda0a000d90 [ 3003.329416] ESID = 00003f000800000c VSID = 400039f536000d90 [ 3003.329417] ESID = 000000001800000d VSID = 0001251b35150d90 [ 3003.329417] ESID = 000001000800000e VSID = 4001e46090000d90 [ 3003.329418] ESID = d000080008000019 VSID = 40013d349c000400 [ 3003.329419] lpcr = c048800001847001 sdr1 = 0000001b19000006 last_inst = ffffffff [ 3003.329421] trap=0xe60 | pc=0xc0000000000c2ba0 | msr=0x8000000000009032 [ 3003.329524] Severe Hypervisor Maintenance interrupt [Recovered] [ 3003.329526] Error detail: Timer facility experienced an error [ 3003.329527] HMER: 0840000000000000 [ 3003.329527] TFMR: 4a12000980a94000 [ 3006.359786] Severe Hypervisor Maintenance interrupt [Recovered] [ 3006.359792] Error detail: Timer facility experienced an error [ 3006.359795] HMER: 0840000000000000 [ 3006.359797] TFMR: 4a12000980a84000 Id Name State ---------------------------------------------------- 2 guest2 running 3 guest3 paused 4 guest4 running Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a59c1d9e |
|
09-Sep-2014 |
Madhavan Srinivasan <maddy@linux.vnet.ibm.com> |
powerpc/kvm: support to handle sw breakpoint This patch adds kernel side support for software breakpoint. Design is that, by using an illegal instruction, we trap to hypervisor via Emulation Assistance interrupt, where we check for the illegal instruction and accordingly we return to Host or Guest. Patch also adds support for software breakpoint in PR KVM. Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
9333e6c4 |
|
02-Sep-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Only accept host PVR value for guest PVR Since the guest can read the machine's PVR (Processor Version Register) directly and see the real value, we should disallow userspace from setting any value for the guest's PVR other than the real host value. Therefore this makes kvm_arch_vcpu_set_sregs_hv() check the supplied PVR value and return an error if it is different from the host value, which has been put into vcpu->arch.pvr at vcpu creation time. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
b754c739 |
|
02-Sep-2014 |
Paul Mackerras <paulus@au1.ibm.com> |
KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threads Occasional failures have been seen with split-core mode and migration where the message "KVM: couldn't grab cpu" appears. This increases the length of time that we wait from 1ms to 10ms, which seems to work around the issue. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
63fff5c1 |
|
29-Jun-2014 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page When calculating the lower bits of AVA field, use the shift count based on the base page size. Also add the missing segment size and remove stale comment. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
9678cdaa |
|
17-Jul-2014 |
Stewart Smith <stewart@linux.vnet.ibm.com> |
Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 The POWER8 processor has a Micro Partition Prefetch Engine, which is a fancy way of saying "has way to store and load contents of L2 or L2+MRU way of L3 cache". We initiate the storing of the log (list of addresses) using the logmpp instruction and start restore by writing to a SPR. The logmpp instruction takes parameters in a single 64bit register: - starting address of the table to store log of L2/L2+L3 cache contents - 32kb for L2 - 128kb for L2+L3 - Aligned relative to maximum size of the table (32kb or 128kb) - Log control (no-op, L2 only, L2 and L3, abort logout) We should abort any ongoing logging before initiating one. To initiate restore, we write to the MPPR SPR. The format of what to write to the SPR is similar to the logmpp instruction parameter: - starting address of the table to read from (same alignment requirements) - table size (no data, until end of table) - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or 32 cycles) The idea behind loading and storing the contents of L2/L3 cache is to reduce memory latency in a system that is frequently swapping vcores on a physical CPU. The best case scenario for doing this is when some vcores are doing very cache heavy workloads. The worst case is when they have about 0 cache hits, so we just generate needless memory operations. This implementation just does L2 store/load. In my benchmarks this proves to be useful. Benchmark 1: - 16 core POWER8 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each - No split core/SMT - two guests running sysbench memory test. sysbench --test=memory --num-threads=8 run - one guest running apache bench (of default HTML page) ab -n 490000 -c 400 http://localhost/ This benchmark aims to measure performance of real world application (apache) where other guests are cache hot with their own workloads. The sysbench memory benchmark does pointer sized writes to a (small) memory buffer in a loop. In this benchmark with this patch I can see an improvement both in requests per second (~5%) and in mean and median response times (again, about 5%). The spread of minimum and maximum response times were largely unchanged. benchmark 2: - Same VM config as benchmark 1 - all three guests running sysbench memory benchmark This benchmark aims to see if there is a positive or negative affect to this cache heavy benchmark. Although due to the nature of the benchmark (stores) we may not see a difference in performance, but rather hopefully an improvement in consistency of performance (when vcore switched in, don't have to wait many times for cachelines to be pulled in) The results of this benchmark are improvements in consistency of performance rather than performance itself. With this patch, the few outliers in duration go away and we get more consistent performance in each guest. benchmark 3: - same 3 guests and CPU configuration as benchmark 1 and 2. - two idle guests - 1 guest running STREAM benchmark This scenario also saw performance improvement with this patch. On Copy and Scale workloads from STREAM, I got 5-6% improvement with this patch. For Add and triad, it was around 10% (or more). benchmark 4: - same 3 guests as previous benchmarks - two guests running sysbench --memory, distinctly different cache heavy workload - one guest running STREAM benchmark. Similar improvements to benchmark 3. benchmark 5: - 1 guest, 8 VCPUs, Ubuntu 14.04 - Host configured with split core (SMT8, subcores-per-core=4) - STREAM benchmark In this benchmark, we see a 10-20% performance improvement across the board of STREAM benchmark results with this patch. Based on preliminary investigation and microbenchmarks by Prerna Saxena <prerna@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
de9bdd1a |
|
17-Jul-2014 |
Stewart Smith <stewart@linux.vnet.ibm.com> |
Split out struct kvmppc_vcore creation to separate function No code changes, just split it out to a function so that with the addition of micro partition prefetch buffer allocation (in subsequent patch) looks neater and doesn't require excessive indentation. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a0840240 |
|
19-Jul-2014 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
KVM: PPC: Book3S: Fix LPCR one_reg interface Unfortunately, the LPCR got defined as a 32-bit register in the one_reg interface. This is unfortunate because KVM allows userspace to control the DPFD (default prefetch depth) field, which is in the upper 32 bits. The result is that DPFD always get set to 0, which reduces performance in the guest. We can't just change KVM_REG_PPC_LPCR to be a 64-bit register ID, since that would break existing userspace binaries. Instead we define a new KVM_REG_PPC_LPCR_64 id which is 64-bit. Userspace can still use the old KVM_REG_PPC_LPCR id, but it now only modifies those fields in the bottom 32 bits that userspace can modify (ILE, TC and AIL). If userspace uses the new KVM_REG_PPC_LPCR_64 id, it can modify DPFD as well. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: stable@vger.kernel.org Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
02407552 |
|
11-Jun-2014 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Book3S HV: Access guest VPA in BE There are a few shared data structures between the host and the guest. Most of them get registered through the VPA interface. These data structures are defined to always be in big endian byte order, so let's make sure we always access them in big endian. Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
9642382e |
|
01-Jun-2014 |
Michael Neuling <mikey@neuling.org> |
KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling This adds support for the H_SET_MODE hcall. This hcall is a multiplexer that has several functions, some of which are called rarely, and some which are potentially called very frequently. Here we add support for the functions that set the debug registers CIABR (Completed Instruction Address Breakpoint Register) and DAWR/DAWRX (Data Address Watchpoint Register and eXtension), since they could be updated by the guest as often as every context switch. This also adds a kvmppc_power8_compatible() function to test to see if a guest is compatible with POWER8 or not. The CIABR and DAWR/X only exist on POWER8. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
ae2113a4 |
|
01-Jun-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL capability is used to enable or disable in-kernel handling of an hcall, that the hcall is actually implemented by the kernel. If not an EINVAL error is returned. This also checks the default-enabled list of hcalls and prints a warning if any hcall there is not actually implemented. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
699a0ea0 |
|
01-Jun-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling This provides a way for userspace controls which sPAPR hcalls get handled in the kernel. Each hcall can be individually enabled or disabled for in-kernel handling, except for H_RTAS. The exception for H_RTAS is because userspace can already control whether individual RTAS functions are handled in-kernel or not via the KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for H_RTAS is out of the normal sequence of hcall numbers. Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM. The args field of the struct kvm_enable_cap specifies the hcall number in args[0] and the enable/disable flag in args[1]; 0 means disable in-kernel handling (so that the hcall will always cause an exit to userspace) and 1 means enable. Enabling or disabling in-kernel handling of an hcall is effective across the whole VM. The ability for KVM_ENABLE_CAP to be used on a VM file descriptor on PowerPC is new, added by this commit. The KVM_CAP_ENABLE_CAP_VM capability advertises that this ability exists. When a VM is created, an initial set of hcalls are enabled for in-kernel handling. The set that is enabled is the set that have an in-kernel implementation at this point. Any new hcall implementations from this point onwards should not be added to the default set without a good reason. No distinction is made between real-mode and virtual-mode hcall implementations; the one setting controls them both. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
06da28e7 |
|
05-Jun-2014 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
KVM: PPC: BOOK3S: PR: Emulate instruction counter Writing to IC is not allowed in the privileged mode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
8f42ab27 |
|
05-Jun-2014 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
KVM: PPC: BOOK3S: PR: Emulate virtual timebase register virtual time base register is a per VM, per cpu register that needs to be saved and restored on vm exit and entry. Writing to VTB is not allowed in the privileged mode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: fix compile error] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
1f365bb0 |
|
06-May-2014 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest On recent IBM Power CPUs, while the hashed page table is looked up using the page size from the segmentation hardware (i.e. the SLB), it is possible to have the HPT entry indicate a larger page size. Thus for example it is possible to put a 16MB page in a 64kB segment, but since the hash lookup is done using a 64kB page size, it may be necessary to put multiple entries in the HPT for a single 16MB page. This capability is called mixed page-size segment (MPSS). With MPSS, there are two relevant page sizes: the base page size, which is the size used in searching the HPT, and the actual page size, which is the size indicated in the HPT entry. [ Note that the actual page size is always >= base page size ]. We use "ibm,segment-page-sizes" device tree node to advertise the MPSS support to PAPR guest. The penc encoding indicates whether we support a specific combination of base page size and actual page size in the same segment. We also use the penc value in the LP encoding of HPTE entry. This patch exposes MPSS support to KVM guest by advertising the feature via "ibm,segment-page-sizes". It also adds the necessary changes to decode the base page size and the actual page size correctly from the HPTE entry. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
2e23f544 |
|
29-Apr-2014 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Book3S PR: Expose EBB registers POWER8 introduces a new facility called the "Event Based Branch" facility. It contains of a few registers that indicate where a guest should branch to when a defined event occurs and it's in PR mode. We don't want to really enable EBB as it will create a big mess with !PR guest mode while hardware is in PR and we don't really emulate the PMU anyway. So instead, let's just leave it at emulation of all its registers. Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
e14e7a1e |
|
21-Apr-2014 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Book3S PR: Expose TAR facility to guest POWER8 implements a new register called TAR. This register has to be enabled in FSCR and then from KVM's point of view is mere storage. This patch enables the guest to use TAR. Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
616dff86 |
|
29-Apr-2014 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR POWER8 introduced a new interrupt type called "Facility unavailable interrupt" which contains its status message in a new register called FSCR. Handle these exits and try to emulate instructions for unhandled facilities. Follow-on patches enable KVM to expose specific facilities into the guest. Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
5deb8e7a |
|
24-Apr-2014 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Make shared struct aka magic page guest endian The shared (magic) page is a data structure that contains often used supervisor privileged SPRs accessible via memory to the user to reduce the number of exits we have to take to read/write them. When we actually share this structure with the guest we have to maintain it in guest endianness, because some of the patch tricks only work with native endian load/store operations. Since we only share the structure with either host or guest in little endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv. For booke, the shared struct stays big endian. For book3s_64 hv we maintain the struct in host native endian, since it never gets shared with the guest. For book3s_64 pr we introduce a variable that tells us which endianness the shared struct is in and route every access to it through helper inline functions that evaluate this variable. Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
3102f784 |
|
23-May-2014 |
Michael Ellerman <mpe@ellerman.id.au> |
powerpc/kvm/book3s_hv: Use threads_per_subcore in KVM To support split core on POWER8 we need to modify various parts of the KVM code to use threads_per_subcore instead of threads_per_core. On systems that do not support split core threads_per_subcore == threads_per_core and these changes are a nop. We use threads_per_subcore as the value reported by KVM_CAP_PPC_SMT. This communicates to userspace that guests can only be created with a value of threads_per_core that is less than or equal to the current threads_per_subcore. This ensures that guests can only be created with a thread configuration that we are able to run given the current split core mode. Although threads_per_subcore can change during the life of the system, the commit that enables that will ensure that threads_per_subcore does not change during the life of a KVM VM. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Neuling <mikey@neuling.org> Acked-by: Alexander Graf <agraf@suse.de> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
441c19c8 |
|
23-May-2014 |
Michael Ellerman <mpe@ellerman.id.au> |
powerpc/kvm/book3s_hv: Rework the secondary inhibit code As part of the support for split core on POWER8, we want to be able to block splitting of the core while KVM VMs are active. The logic to do that would be exactly the same as the code we currently have for inhibiting onlining of secondaries. Instead of adding an identical mechanism to block split core, rework the secondary inhibit code to be a "HV KVM is active" check. We can then use that in both the cpu hotplug code and the upcoming split core code. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Neuling <mikey@neuling.org> Acked-by: Alexander Graf <agraf@suse.de> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
739e2425 |
|
24-Mar-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Return ENODEV error rather than EIO If an attempt is made to load the kvm-hv module on a machine which doesn't have hypervisor mode available, return an ENODEV error, which is the conventional thing to return to indicate that this module is not applicable to the hardware of the current machine, rather than EIO, which causes a warning to be printed. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Scott Wood <scottwood@freescale.com>
|
#
a7d80d01 |
|
24-Mar-2014 |
Michael Neuling <mikey@neuling.org> |
KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state This adds code to get/set_one_reg to read and write the new transactional memory (TM) state. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Scott Wood <scottwood@freescale.com>
|
#
7505258c |
|
24-Mar-2014 |
Anton Blanchard <anton@samba.org> |
KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n I noticed KVM is broken when KVM in-kernel XICS emulation (CONFIG_KVM_XICS) is disabled. The problem was introduced in 48eaef05 (KVM: PPC: Book3S HV: use xics_wake_cpu only when defined). It used CONFIG_KVM_XICS to wrap xics_wake_cpu, where CONFIG_PPC_ICP_NATIVE should have been used. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Scott Wood <scottwood@freescale.com>
|
#
e59d24e6 |
|
06-Feb-2014 |
Greg Kurz <groug@kaod.org> |
KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write When the guest does an MMIO write which is handled successfully by an ioeventfd, ioeventfd_write() returns 0 (success) and kvmppc_handle_store() returns EMULATE_DONE. Then kvmppc_emulate_mmio() converts EMULATE_DONE to RESUME_GUEST_NV and this causes an exit from the loop in kvmppc_vcpu_run_hv(), causing an exit back to userspace with a bogus exit reason code, typically causing userspace (e.g. qemu) to crash with a message about an unknown exit code. This adds handling of RESUME_GUEST_NV in kvmppc_vcpu_run_hv() in order to fix that. For generality, we define a helper to check for either of the return-to-guest codes we use, RESUME_GUEST and RESUME_GUEST_NV, to make it easy to check for either and provide one place to update if any other return-to-guest code gets defined in future. Since it only affects Book3S HV for now, the helper is added to the kvm_book3s.h header file. We use the helper in two places in kvmppc_run_core() as well for future-proofing, though we don't see RESUME_GUEST_NV in either place at present. [paulus@samba.org - combined 4 patches into one, rewrote description] Suggested-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
|
#
7b490411 |
|
08-Jan-2014 |
Michael Neuling <mikey@neuling.org> |
KVM: PPC: Book3S HV: Add new state for transactional memory Add new state for transactional memory (TM) to kvm_vcpu_arch. Also add asm-offset bits that are going to be required. This also moves the existing TFHAR, TFIAR and TEXASR SPRs into a CONFIG_PPC_TRANSACTIONAL_MEM section. This requires some code changes to ensure we still compile with CONFIG_PPC_TRANSACTIONAL_MEM=N. Much of the added the added #ifdefs are removed in a later patch when the bulk of the TM code is added. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix merge conflict] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
d682916a |
|
08-Jan-2014 |
Anton Blanchard <anton@samba.org> |
KVM: PPC: Book3S HV: Basic little-endian guest support We create a guest MSR from scratch when delivering exceptions in a few places. Instead of extracting LPCR[ILE] and inserting it into MSR_LE each time, we simply create a new variable intr_msr which contains the entire MSR to use. For a little-endian guest, userspace needs to set the ILE (interrupt little-endian) bit in the LPCR for each vcpu (or at least one vcpu in each virtual core). [paulus@samba.org - removed H_SET_MODE implementation from original version of the patch, and made kvmppc_set_lpcr update vcpu->arch.intr_msr.] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
8563bf52 |
|
08-Jan-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Add support for DABRX register on POWER7 The DABRX (DABR extension) register on POWER7 processors provides finer control over which accesses cause a data breakpoint interrupt. It contains 3 bits which indicate whether to enable accesses in user, kernel and hypervisor modes respectively to cause data breakpoint interrupts, plus one bit that enables both real mode and virtual mode accesses to cause interrupts. Currently, KVM sets DABRX to allow both kernel and user accesses to cause interrupts while in the guest. This adds support for the guest to specify other values for DABRX. PAPR defines a H_SET_XDABR hcall to allow the guest to set both DABR and DABRX with one call. This adds a real-mode implementation of H_SET_XDABR, which shares most of its code with the existing H_SET_DABR implementation. To support this, we add a per-vcpu field to store the DABRX value plus code to get and set it via the ONE_REG interface. For Linux guests to use this new hcall, userspace needs to add "hcall-xdabr" to the set of strings in the /chosen/hypertas-functions property in the device tree. If userspace does this and then migrates the guest to a host where the kernel doesn't include this patch, then userspace will need to implement H_SET_XDABR by writing the specified DABR value to the DABR using the ONE_REG interface. In that case, the old kernel will set DABRX to DABRX_USER | DABRX_KERNEL. That should still work correctly, at least for Linux guests, since Linux guests cope with getting data breakpoint interrupts in modes that weren't requested by just ignoring the interrupt, and Linux guests never set DABRX_BTI. The other thing this does is to make H_SET_DABR and H_SET_XDABR work on POWER8, which has the DAWR and DAWRX instead of DABR/X. Guests that know about POWER8 should use H_SET_MODE rather than H_SET_[X]DABR, but guests running in POWER7 compatibility mode will still use H_SET_[X]DABR. For them, this adds the logic to convert DABR/X values into DAWR/X values on POWER8. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
5d00f66b |
|
08-Jan-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells POWER8 has support for hypervisor doorbell interrupts. Though the kernel doesn't use them for IPIs on the powernv platform yet, it probably will in future, so this makes KVM cope gracefully if a hypervisor doorbell interrupt arrives while in a guest. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
e0622bd9 |
|
08-Jan-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8 POWER8 has a bit in the LPCR to enable or disable the PURR and SPURR registers to count when in the guest. Set this bit. POWER8 has a field in the LPCR called AIL (Alternate Interrupt Location) which is used to enable relocation-on interrupts. Allow userspace to set this field. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
5557ae0e |
|
08-Jan-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8 This allows us to select architecture 2.05 (POWER6) or 2.06 (POWER7) compatibility modes on a POWER8 processor. (Note that transactional memory is disabled for usermode if either or both of the PCR_TM_DIS and PCR_ARCH_206 bits are set.) Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
bd3048b8 |
|
08-Jan-2014 |
Michael Ellerman <michael@ellerman.id.au> |
KVM: PPC: Book3S HV: Add handler for HV facility unavailable At present this should never happen, since the host kernel sets HFSCR to allow access to all facilities. It's better to be prepared to handle it cleanly if it does ever happen, though. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
b005255e |
|
08-Jan-2014 |
Michael Neuling <mikey@neuling.org> |
KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs This adds fields to the struct kvm_vcpu_arch to store the new guest-accessible SPRs on POWER8, adds code to the get/set_one_reg functions to allow userspace to access this state, and adds code to the guest entry and exit to context-switch these SPRs between host and guest. Note that DPDES (Directed Privileged Doorbell Exception State) is shared between threads on a core; hence we store it in struct kvmppc_vcore and have the master thread save and restore it. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
e0b7ec05 |
|
08-Jan-2014 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers On a threaded processor such as POWER7, we group VCPUs into virtual cores and arrange that the VCPUs in a virtual core run on the same physical core. Currently we don't enforce any correspondence between virtual thread numbers within a virtual core and physical thread numbers. Physical threads are allocated starting at 0 on a first-come first-served basis to runnable virtual threads (VCPUs). POWER8 implements a new "msgsndp" instruction which guest kernels can use to interrupt other threads in the same core or sub-core. Since the instruction takes the destination physical thread ID as a parameter, it becomes necessary to align the physical thread IDs with the virtual thread IDs, that is, to make sure virtual thread N within a virtual core always runs on physical thread N. This means that it's possible that thread 0, which is where we call __kvmppc_vcore_entry, may end up running some other vcpu than the one whose task called kvmppc_run_core(), or it may end up running no vcpu at all, if for example thread 0 of the virtual core is currently executing in userspace. However, we do need thread 0 to be responsible for switching the MMU -- a previous version of this patch that had other threads switching the MMU was found to be responsible for occasional memory corruption and machine check interrupts in the guest on POWER7 machines. To accommodate this, we no longer pass the vcpu pointer to __kvmppc_vcore_entry, but instead let the assembly code load it from the PACA. Since the assembly code will need to know the kvm pointer and the thread ID for threads which don't have a vcpu, we move the thread ID into the PACA and we add a kvm pointer to the virtual core structure. In the case where thread 0 has no vcpu to run, it still calls into kvmppc_hv_entry in order to do the MMU switch, and then naps until either its vcpu is ready to run in the guest, or some other thread needs to exit the guest. In the latter case, thread 0 jumps to the code that switches the MMU back to the host. This control flow means that now we switch the MMU before loading any guest vcpu state. Similarly, on guest exit we now save all the guest vcpu state before switching the MMU back to the host. This has required substantial code movement, making the diff rather large. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
48eaef05 |
|
30-Dec-2013 |
Andreas Schwab <schwab@linux-m68k.org> |
KVM: PPC: Book3S HV: use xics_wake_cpu only when defined Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> CC: stable@vger.kernel.org Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
efff1912 |
|
15-Oct-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Store FP/VSX/VMX state in thread_fp/vr_state structures This uses struct thread_fp_state and struct thread_vr_state to store the floating-point, VMX/Altivec and VSX state, rather than flat arrays. This makes transferring the state to/from the thread_struct simpler and allows us to unify the get/set_one_reg implementations for the VSX registers. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
398a76c6 |
|
09-Dec-2013 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Add devname:kvm aliases for modules Systems that support automatic loading of kernel modules through device aliases should try and automatically load kvm when /dev/kvm gets opened. Add code to support that magic for all PPC kvm targets, even the ones that don't support modules yet. Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
c08ac06a |
|
12-Dec-2013 |
Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> |
KVM: Use cond_resched() directly and remove useless kvm_resched() Since the commit 15ad7146 ("KVM: Use the scheduler preemption notifiers to make kvm preemptible"), the remaining stuff in this function is a simple cond_resched() call with an extra need_resched() check which was there to avoid dropping VCPUs unnecessarily. Now it is meaningless. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
27025a60 |
|
18-Nov-2013 |
Liu Ping Fan <kernelfans@gmail.com> |
powerpc: kvm: optimize "sc 1" as fast return In some scene, e.g openstack CI, PR guest can trigger "sc 1" frequently, this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL to HV guest, so powernv can return to HV guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
c9438092 |
|
15-Nov-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call Running a kernel with CONFIG_PROVE_RCU=y yields the following diagnostic: =============================== [ INFO: suspicious RCU usage. ] 3.12.0-rc5-kvm+ #9 Not tainted ------------------------------- include/linux/kvm_host.h:473 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-ppc/4831: stack backtrace: CPU: 28 PID: 4831 Comm: qemu-system-ppc Not tainted 3.12.0-rc5-kvm+ #9 Call Trace: [c000000be462b2a0] [c00000000001644c] .show_stack+0x7c/0x1f0 (unreliable) [c000000be462b370] [c000000000ad57c0] .dump_stack+0x88/0xb4 [c000000be462b3f0] [c0000000001315e8] .lockdep_rcu_suspicious+0x138/0x180 [c000000be462b480] [c00000000007862c] .gfn_to_memslot+0x13c/0x170 [c000000be462b510] [c00000000007d384] .gfn_to_hva_prot+0x24/0x90 [c000000be462b5a0] [c00000000007d420] .kvm_read_guest_page+0x30/0xd0 [c000000be462b630] [c00000000007d528] .kvm_read_guest+0x68/0x110 [c000000be462b6e0] [c000000000084594] .kvmppc_rtas_hcall+0x34/0x180 [c000000be462b7d0] [c000000000097934] .kvmppc_pseries_do_hcall+0x74/0x830 [c000000be462b880] [c0000000000990e8] .kvmppc_vcpu_run_hv+0xff8/0x15a0 [c000000be462b9e0] [c0000000000839cc] .kvmppc_vcpu_run+0x2c/0x40 [c000000be462ba50] [c0000000000810b4] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0 [c000000be462bae0] [c00000000007b508] .kvm_vcpu_ioctl+0x478/0x730 [c000000be462bca0] [c00000000025532c] .do_vfs_ioctl+0x4dc/0x7a0 [c000000be462bd80] [c0000000002556b4] .SyS_ioctl+0xc4/0xe0 [c000000be462be30] [c000000000009ee4] syscall_exit+0x0/0x98 To fix this, we take the SRCU read lock around the kvmppc_rtas_hcall() call. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
bf3d32e1 |
|
15-Nov-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Make tbacct_lock irq-safe Lockdep reported that there is a potential for deadlock because vcpu->arch.tbacct_lock is not irq-safe, and is sometimes taken inside the rq_lock (run-queue lock) in the scheduler, which is taken within interrupts. The lockdep splat looks like: ====================================================== [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ] 3.12.0-rc5-kvm+ #8 Not tainted ------------------------------------------------------ qemu-system-ppc/4803 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}, at: [<c0000000000947ac>] .kvmppc_core_vcpu_put_hv+0x2c/0xa0 and this task is already holding: (&rq->lock){-.-.-.}, at: [<c000000000ac16c0>] .__schedule+0x180/0xaa0 which would create a new lock dependency: (&rq->lock){-.-.-.} -> (&(&vcpu->arch.tbacct_lock)->rlock){+.+...} but this new dependency connects a HARDIRQ-irq-safe lock: (&rq->lock){-.-.-.} ... which became HARDIRQ-irq-safe at: [<c00000000013797c>] .lock_acquire+0xbc/0x190 [<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60 [<c0000000000f8564>] .scheduler_tick+0x54/0x180 [<c0000000000c2610>] .update_process_times+0x70/0xa0 [<c00000000012cdfc>] .tick_periodic+0x3c/0xe0 [<c00000000012cec8>] .tick_handle_periodic+0x28/0xb0 [<c00000000001ef40>] .timer_interrupt+0x120/0x2e0 [<c000000000002868>] decrementer_common+0x168/0x180 [<c0000000001c7ca4>] .get_page_from_freelist+0x924/0xc10 [<c0000000001c8e00>] .__alloc_pages_nodemask+0x200/0xba0 [<c0000000001c9eb8>] .alloc_pages_exact_nid+0x68/0x110 [<c000000000f4c3ec>] .page_cgroup_init+0x1e0/0x270 [<c000000000f24480>] .start_kernel+0x3e0/0x4e4 [<c000000000009d30>] .start_here_common+0x20/0x70 to a HARDIRQ-irq-unsafe lock: (&(&vcpu->arch.tbacct_lock)->rlock){+.+...} ... which became HARDIRQ-irq-unsafe at: ... [<c00000000013797c>] .lock_acquire+0xbc/0x190 [<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60 [<c0000000000946ac>] .kvmppc_core_vcpu_load_hv+0x2c/0x100 [<c00000000008394c>] .kvmppc_core_vcpu_load+0x2c/0x40 [<c000000000081000>] .kvm_arch_vcpu_load+0x10/0x30 [<c00000000007afd4>] .vcpu_load+0x64/0xd0 [<c00000000007b0f8>] .kvm_vcpu_ioctl+0x68/0x730 [<c00000000025530c>] .do_vfs_ioctl+0x4dc/0x7a0 [<c000000000255694>] .SyS_ioctl+0xc4/0xe0 [<c000000000009ee4>] syscall_exit+0x0/0x98 Some users have reported this deadlock occurring in practice, though the reports have been primarily on 3.10.x-based kernels. This fixes the problem by making tbacct_lock be irq-safe. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a78b55d1 |
|
07-Oct-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
kvm: powerpc: book3s: drop is_hv_enabled drop is_hv_enabled, because that should not be a callback property Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
cbbc58d4 |
|
07-Oct-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
kvm: powerpc: book3s: Allow the HV and PR selection per virtual machine This moves the kvmppc_ops callbacks to be a per VM entity. This enables us to select HV and PR mode when creating a VM. We also allow both kvm-hv and kvm-pr kernel module to be loaded. To achieve this we move /dev/kvm ownership to kvm.ko module. Depending on which KVM mode we select during VM creation we take a reference count on respective module Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: fix coding style] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
2ba9f0d8 |
|
07-Oct-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
kvm: powerpc: book3s: Support building HV and PR KVM as module Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: squash in compile fix] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
699cc876 |
|
07-Oct-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
kvm: powerpc: book3s: Add is_hv_enabled to kvmppc_ops This help us to identify whether we are running with hypervisor mode KVM enabled. The change is needed so that we can have both HV and PR kvm enabled in the same kernel. If both HV and PR KVM are included, interrupts come in to the HV version of the kvmppc_interrupt code, which then jumps to the PR handler, renamed to kvmppc_interrupt_pr, if the guest is a PR guest. Allowing both PR and HV in the same kernel required some changes to kvm_dev_ioctl_check_extension(), since the values returned now can't be selected with #ifdefs as much as previously. We look at is_hv_enabled to return the right value when checking for capabilities.For capabilities that are only provided by HV KVM, we return the HV value only if is_hv_enabled is true. For capabilities provided by PR KVM but not HV, we return the PR value only if is_hv_enabled is false. NOTE: in later patch we replace is_hv_enabled with a static inline function comparing kvm_ppc_ops Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
3a167bea |
|
07-Oct-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
kvm: powerpc: Add kvmppc_ops callback This patch add a new callback kvmppc_ops. This will help us in enabling both HV and PR KVM together in the same kernel. The actual change to enable them together is done in the later patch in the series. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: squash in booke changes] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
f1378b1c |
|
27-Sep-2013 |
Paul Mackerras <paulus@samba.org> |
kvm: powerpc: book3s hv: Fix vcore leak add kvmppc_free_vcores() to free the kvmppc_vcore structures that we allocate for a guest, which are currently being leaked. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
f3271d4c |
|
19-Sep-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Don't crash host on unknown guest interrupt If we come out of a guest with an interrupt that we don't know about, instead of crashing the host with a BUG(), we now return to userspace with the exit reason set to KVM_EXIT_UNKNOWN and the trap vector in the hw.hardware_exit_reason field of the kvm_run structure, as is done on x86. Note that run->exit_reason is already set to KVM_EXIT_UNKNOWN at the beginning of kvmppc_handle_exit(). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
388cc6e1 |
|
20-Sep-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7 This enables us to use the Processor Compatibility Register (PCR) on POWER7 to put the processor into architecture 2.05 compatibility mode when running a guest. In this mode the new instructions and registers that were introduced on POWER7 are disabled in user mode. This includes all the VSX facilities plus several other instructions such as ldbrx, stdbrx, popcntw, popcntd, etc. To select this mode, we have a new register accessible through the set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT. Setting this to zero gives the full set of capabilities of the processor. Setting it to one of the "logical" PVR values defined in PAPR puts the vcpu into the compatibility mode for the corresponding architecture level. The supported values are: 0x0f000002 Architecture 2.05 (POWER6) 0x0f000003 Architecture 2.06 (POWER7) 0x0f100003 Architecture 2.06+ (POWER7+) Since the PCR is per-core, the architecture compatibility level and the corresponding PCR value are stored in the struct kvmppc_vcore, and are therefore shared between all vcpus in a virtual core. Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: squash in fix to add missing break statements and documentation] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
4b8473c9 |
|
19-Sep-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Add support for guest Program Priority Register POWER7 and later IBM server processors have a register called the Program Priority Register (PPR), which controls the priority of each hardware CPU SMT thread, and affects how fast it runs compared to other SMT threads. This priority can be controlled by writing to the PPR or by use of a set of instructions of the form or rN,rN,rN which are otherwise no-ops but have been defined to set the priority to particular levels. This adds code to context switch the PPR when entering and exiting guests and to make the PPR value accessible through the SET/GET_ONE_REG interface. When entering the guest, we set the PPR as late as possible, because if we are setting a low thread priority it will make the code run slowly from that point on. Similarly, the first-level interrupt handlers save the PPR value in the PACA very early on, and set the thread priority to the medium level, so that the interrupt handling code runs at a reasonable speed. Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a0144e2a |
|
19-Sep-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Store LPCR value for each virtual core This adds the ability to have a separate LPCR (Logical Partitioning Control Register) value relating to a guest for each virtual core, rather than only having a single value for the whole VM. This corresponds to what real POWER hardware does, where there is a LPCR per CPU thread but most of the fields are required to have the same value on all active threads in a core. The per-virtual-core LPCR can be read and written using the GET/SET_ONE_REG interface. Userspace can can only modify the following fields of the LPCR value: DPFD Default prefetch depth ILE Interrupt little-endian TC Translation control (secondary HPT hash group search disable) We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which contains bits relating to memory management, i.e. the Virtualized Partition Memory (VPM) bits and the bits relating to guest real mode. When this default value is updated, the update needs to be propagated to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do that. Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix whitespace] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
42d7604d |
|
05-Sep-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Implement H_CONFER The H_CONFER hypercall is used when a guest vcpu is spinning on a lock held by another vcpu which has been preempted, and the spinning vcpu wishes to give its timeslice to the lock holder. We implement this in the straightforward way using kvm_vcpu_yield_to(). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
93b0f4dc |
|
05-Sep-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Implement timebase offset for guests This allows guests to have a different timebase origin from the host. This is needed for migration, where a guest can migrate from one host to another and the two hosts might have a different timebase origin. However, the timebase seen by the guest must not go backwards, and should go forwards only by a small amount corresponding to the time taken for the migration. Therefore this provides a new per-vcpu value accessed via the one_reg interface using the new KVM_REG_PPC_TB_OFFSET identifier. This value defaults to 0 and is not modified by KVM. On entering the guest, this value is added onto the timebase, and on exiting the guest, it is subtracted from the timebase. This is only supported for recent POWER hardware which has the TBU40 (timebase upper 40 bits) register. Writing to the TBU40 register only alters the upper 40 bits of the timebase, leaving the lower 24 bits unchanged. This provides a way to modify the timebase for guest migration without disturbing the synchronization of the timebase registers across CPU cores. The kernel rounds up the value given to a multiple of 2^24. Timebase values stored in KVM structures (struct kvm_vcpu, struct kvmppc_vcore, etc.) are stored as host timebase values. The timebase values in the dispatch trace log need to be guest timebase values, however, since that is read directly by the guest. This moves the setting of vcpu->arch.dec_expires on guest exit to a point after we have restored the host timebase so that vcpu->arch.dec_expires is a host timebase value. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
14941789 |
|
05-Sep-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Save/restore SIAR and SDAR along with other PMU registers Currently we are not saving and restoring the SIAR and SDAR registers in the PMU (performance monitor unit) on guest entry and exit. The result is that performance monitoring tools in the guest could get false information about where a program was executing and what data it was accessing at the time of a performance monitor interrupt. This fixes it by saving and restoring these registers along with the other PMU registers on guest entry/exit. This also provides a way for userspace to access these values for a vcpu via the one_reg interface. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
5d226ae5 |
|
22-Jul-2013 |
Chen Gang <gang.chen@asianux.com> |
arch: powerpc: kvm: add signed type cast for comparation 'rmls' is 'unsigned long', lpcr_rmls() will return negative number when failure occurs, so it need a type cast for comparing. 'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number when failure occurs, so it need a type cast for comparing. Signed-off-by: Chen Gang <gang.chen@asianux.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
2f84d5ea |
|
24-Aug-2013 |
Yann Droneaud <ydroneaud@opteya.com> |
ppc: kvm: use anon_inode_getfd() with O_CLOEXEC flag KVM uses anon_inode_get() to allocate file descriptors as part of some of its ioctls. But those ioctls are lacking a flag argument allowing userspace to choose options for the newly opened file descriptor. In such case it's advised to use O_CLOEXEC by default so that userspace is allowed to choose, without race, if the file descriptor is going to be inherited across exec(). This patch set O_CLOEXEC flag on all file descriptors created with anon_inode_getfd() to not leak file descriptors across exec(). Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com Reviewed-by: Alexander Graf <agraf@suse.de> Signed-off-by: Gleb Natapov <gleb@redhat.com>
|
#
87916442 |
|
22-Aug-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
powerpc/kvm: Copy the pvr value after memset Otherwise we would clear the pvr value Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
f13c13a0 |
|
06-Aug-2013 |
Anton Blanchard <anton@samba.org> |
powerpc: Stop using non-architected shared_proc field in lppaca Although the shared_proc field in the lppaca works today, it is not architected. A shared processor partition will always have a non zero yield_count so use that instead. Create a wrapper so users don't have to know about the details. In order for older kernels to continue to work on KVM we need to set the shared_proc bit. While here, remove the ugly bitfield. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
2fb10672 |
|
22-Jul-2013 |
Chen Gang <gang.chen@asianux.com> |
powerpc/kvm: Add signed type cast for comparation 'rmls' is 'unsigned long', lpcr_rmls() will return negative number when failure occurs, so it need a type cast for comparing. 'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number when failure occurs, so it need a type cast for comparing. Signed-off-by: Chen Gang <gang.chen@asianux.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
6c45b810 |
|
01-Jul-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
powerpc/kvm: Contiguous memory allocator based RMA allocation Older version of power architecture use Real Mode Offset register and Real Mode Limit Selector for mapping guest Real Mode Area. The guest RMA should be physically contigous since we use the range when address translation is not enabled. This patch switch RMA allocation code to use contigous memory allocator. The patch also remove the the linear allocator which not used any more Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
505d6421 |
|
15-Mar-2013 |
Lai Jiangshan <laijs@cn.fujitsu.com> |
powerpc,kvm: fix imbalance srcu_read_[un]lock() At the point of up_out label in kvmppc_hv_setup_htab_rma(), srcu read lock is still held. We have to release it before return. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Alexander Graf <agraf@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: kvm@vger.kernel.org Cc: kvm-ppc@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
|
#
8e44ddc3 |
|
23-May-2013 |
Paul Mackerras <paulus@samba.org> |
powerpc/kvm/book3s: Add support for H_IPOLL and H_XIRR_X in XICS emulation This adds the remaining two hypercalls defined by PAPR for manipulating the XICS interrupt controller, H_IPOLL and H_XIRR_X. H_IPOLL returns information about the priority and pending interrupts for a virtual cpu, without changing any state. H_XIRR_X is like H_XIRR in that it reads and acknowledges the highest-priority pending interrupt, but it also returns the timestamp (timebase register value) from when the interrupt was first received by the hypervisor. Currently we just return the current time, since we don't do any software queueing of virtual interrupts inside the XICS emulation code. These hcalls are not currently used by Linux guests, but may be in future. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
b1022fbd |
|
28-Apr-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
powerpc: Decode the pte-lp-encoding bits correctly. We look at both the segment base page size and actual page size and store the pte-lp-encodings in an array per base page size. We also update all relevant functions to take actual page size argument so that we can use the correct PTE LP encoding in HPTE. This should also get the basic Multiple Page Size per Segment (MPSS) support. This is needed to enable THP on ppc64. [Fixed PR KVM build --BenH] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
4619ac88 |
|
17-Apr-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Improve real-mode handling of external interrupts This streamlines our handling of external interrupts that come in while we're in the guest. First, when waking up a hardware thread that was napping, we split off the "napping due to H_CEDE" case earlier, and use the code that handles an external interrupt (0x500) in the guest to handle that too. Secondly, the code that handles those external interrupts now checks if any other thread is exiting to the host before bouncing an external interrupt to the guest, and also checks that there is actually an external interrupt pending for the guest before setting the LPCR MER bit (mediated external request). This also makes sure that we clear the "ceded" flag when we handle a wakeup from cede in real mode, and fixes a potential infinite loop in kvmppc_run_vcpu() which can occur if we ever end up with the ceded flag set but MSR[EE] off. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
54695c30 |
|
17-Apr-2013 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM Currently, we wake up a CPU by sending a host IPI with smp_send_reschedule() to thread 0 of that core, which will take all threads out of the guest, and cause them to re-evaluate their interrupt status on the way back in. This adds a mechanism to differentiate real host IPIs from IPIs sent by KVM for guest threads to poke each other, in order to target the guest threads precisely when possible and avoid that global switch of the core to host state. We then use this new facility in the in-kernel XICS code. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
bc5ad3f3 |
|
17-Apr-2013 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Each vcpu gets one ICP (interrupt controller presentation) entity, used to store the per-vcpu state such as vcpu priority, pending interrupt state, IPI request, etc. This does not include any API or any way to connect vcpus to their ICP state; that will be added in later patches. This is based on an initial implementation by Michael Ellerman <michael@ellerman.id.au> reworked by Benjamin Herrenschmidt and Paul Mackerras. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix typo, add dependency on !KVM_MPIC] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
8e591cb7 |
|
17-Apr-2013 |
Michael Ellerman <michael@ellerman.id.au> |
KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls For pseries machine emulation, in order to move the interrupt controller code to the kernel, we need to intercept some RTAS calls in the kernel itself. This adds an infrastructure to allow in-kernel handlers to be registered for RTAS services by name. A new ioctl, KVM_PPC_RTAS_DEFINE_TOKEN, then allows userspace to associate token values with those service names. Then, when the guest requests an RTAS service with one of those token values, it will be handled by the relevant in-kernel handler rather than being passed up to userspace as at present. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix warning] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
c35635ef |
|
18-Apr-2013 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications done by the host to the virtual processor areas (VPAs) and dispatch trace logs (DTLs) registered by the guest. This is because those modifications are done either in real mode or in the host kernel context, and in neither case does the access go through the guest's HPT, and thus no change (C) bit gets set in the guest's HPT. However, the changes done by the host do need to be tracked so that the modified pages get transferred when doing live migration. In order to track these modifications, this adds a dirty flag to the struct representing the VPA/DTL areas, and arranges to set the flag when the VPA/DTL gets modified by the host. Then, when we are collecting the dirty log, we also check the dirty flags for the VPA and DTL for each vcpu and set the relevant bit in the dirty log if necessary. Doing this also means we now need to keep track of the guest physical address of the VPA/DTL areas. So as not to lose track of modifications to a VPA/DTL area when it gets unregistered, or when a new area gets registered in its place, we need to transfer the dirty state to the rmap chain. This adds code to kvmppc_unpin_guest_page() to do that if the area was dirty. To simplify that code, we now require that all VPA, DTL and SLB shadow buffer areas fit within a single host page. Guests already comply with this requirement because pHyp requires that these areas not cross a 4k boundary. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
75ef9de1 |
|
04-Apr-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
constify a bunch of struct file_operations instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8482644a |
|
27-Feb-2013 |
Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> |
KVM: set_memory_region: Refactor commit_memory_region() This patch makes the parameter old a const pointer to the old memory slot and adds a new parameter named change to know the change being requested: the former is for removing extra copying and the latter is for cleaning up the code. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
#
bbacc0c1 |
|
10-Dec-2012 |
Alex Williamson <alex.williamson@redhat.com> |
KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM. One is the user accessible slots and the other is user + private. Make this more obvious. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
#
b4072df4 |
|
23-Nov-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking Currently, if a machine check interrupt happens while we are in the guest, we exit the guest and call the host's machine check handler, which tends to cause the host to panic. Some machine checks can be triggered by the guest; for example, if the guest creates two entries in the SLB that map the same effective address, and then accesses that effective address, the CPU will take a machine check interrupt. To handle this better, when a machine check happens inside the guest, we call a new function, kvmppc_realmode_machine_check(), while still in real mode before exiting the guest. On POWER7, it handles the cases that the guest can trigger, either by flushing and reloading the SLB, or by flushing the TLB, and then it delivers the machine check interrupt directly to the guest without going back to the host. On POWER7, the OPAL firmware patches the machine check interrupt vector so that it gets control first, and it leaves behind its analysis of the situation in a structure pointed to by the opal_mc_evt field of the paca. The kvmppc_realmode_machine_check() function looks at this, and if OPAL reports that there was no error, or that it has handled the error, we also go straight back to the guest with a machine check. We have to deliver a machine check to the guest since the machine check interrupt might have trashed valid values in SRR0/1. If the machine check is one we can't handle in real mode, and one that OPAL hasn't already handled, or on PPC970, we exit the guest and call the host's machine check handler. We do this by jumping to the machine_check_fwnmi label, rather than absolute address 0x200, because we don't want to re-execute OPAL's handler on POWER7. On PPC970, the two are equivalent because address 0x200 just contains a branch. Then, if the host machine check handler decides that the system can continue executing, kvmppc_handle_exit() delivers a machine check interrupt to the guest -- once again to let the guest know that SRR0/1 have been modified. Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix checkpatch warnings] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
1b400ba0 |
|
21-Nov-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations When we change or remove a HPT (hashed page table) entry, we can do either a global TLB invalidation (tlbie) that works across the whole machine, or a local invalidation (tlbiel) that only affects this core. Currently we do local invalidations if the VM has only one vcpu or if the guest requests it with the H_LOCAL flag, though the guest Linux kernel currently doesn't ever use H_LOCAL. Then, to cope with the possibility that vcpus moving around to different physical cores might expose stale TLB entries, there is some code in kvmppc_hv_entry to flush the whole TLB of entries for this VM if either this vcpu is now running on a different physical core from where it last ran, or if this physical core last ran a different vcpu. There are a number of problems on POWER7 with this as it stands: - The TLB invalidation is done per thread, whereas it only needs to be done per core, since the TLB is shared between the threads. - With the possibility of the host paging out guest pages, the use of H_LOCAL by an SMP guest is dangerous since the guest could possibly retain and use a stale TLB entry pointing to a page that had been removed from the guest. - The TLB invalidations that we do when a vcpu moves from one physical core to another are unnecessary in the case of an SMP guest that isn't using H_LOCAL. - The optimization of using local invalidations rather than global should apply to guests with one virtual core, not just one vcpu. (None of this applies on PPC970, since there we always have to invalidate the whole TLB when entering and leaving the guest, and we can't support paging out guest memory.) To fix these problems and simplify the code, we now maintain a simple cpumask of which cpus need to flush the TLB on entry to the guest. (This is indexed by cpu, though we only ever use the bits for thread 0 of each core.) Whenever we do a local TLB invalidation, we set the bits for every cpu except the bit for thread 0 of the core that we're currently running on. Whenever we enter a guest, we test and clear the bit for our core, and flush the TLB if it was set. On initial startup of the VM, and when resetting the HPT, we set all the bits in the need_tlb_flush cpumask, since any core could potentially have stale TLB entries from the previous VM to use the same LPID, or the previous contents of the HPT. Then, we maintain a count of the number of online virtual cores, and use that when deciding whether to use a local invalidation rather than the number of online vcpus. The code to make that decision is extracted out into a new function, global_invalidates(). For multi-core guests on POWER7 (i.e. when we are using mmu notifiers), we now never do local invalidations regardless of the H_LOCAL flag. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a2932923 |
|
19-Nov-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on this fd return the contents of the HPT (hashed page table), writes create and/or remove entries in the HPT. There is a new capability, KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl takes an argument structure with the index of the first HPT entry to read out and a set of flags. The flags indicate whether the user is intending to read or write the HPT, and whether to return all entries or only the "bolted" entries (those with the bolted bit, 0x10, set in the first doubleword). This is intended for use in implementing qemu's savevm/loadvm and for live migration. Therefore, on reads, the first pass returns information about all HPTEs (or all bolted HPTEs). When the first pass reaches the end of the HPT, it returns from the read. Subsequent reads only return information about HPTEs that have changed since they were last read. A read that finds no changed HPTEs in the HPT following where the last read finished will return 0 bytes. The format of the data provides a simple run-length compression of the invalid entries. Each block of data starts with a header that indicates the index (position in the HPT, which is just an array), the number of valid entries starting at that index (may be zero), and the number of invalid entries following those valid entries. The valid entries, 16 bytes each, follow the header. The invalid entries are not explicitly represented. Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix documentation] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
9f8c8c78 |
|
14-Oct-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Allow DTL to be set to address 0, length 0 Commit 55b665b026 ("KVM: PPC: Book3S HV: Provide a way for userspace to get/set per-vCPU areas") includes a check on the length of the dispatch trace log (DTL) to make sure the buffer is at least one entry long. This is appropriate when registering a buffer, but the interface also allows for any existing buffer to be unregistered by specifying a zero address. In this case the length check is not appropriate. This makes the check conditional on the address being non-zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
c7b67670 |
|
14-Oct-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fix accounting of stolen time Currently the code that accounts stolen time tends to overestimate the stolen time, and will sometimes report more stolen time in a DTL (dispatch trace log) entry than has elapsed since the last DTL entry. This can cause guests to underflow the user or system time measured for some tasks, leading to ridiculous CPU percentages and total runtimes being reported by top and other utilities. In addition, the current code was designed for the previous policy where a vcore would only run when all the vcpus in it were runnable, and so only counted stolen time on a per-vcore basis. Now that a vcore can run while some of the vcpus in it are doing other things in the kernel (e.g. handling a page fault), we need to count the time when a vcpu task is preempted while it is not running as part of a vcore as stolen also. To do this, we bring back the BUSY_IN_HOST vcpu state and extend the vcpu_load/put functions to count preemption time while the vcpu is in that state. Handling the transitions between the RUNNING and BUSY_IN_HOST states requires checking and updating two variables (accumulated time stolen and time last preempted), so we add a new spinlock, vcpu->arch.tbacct_lock. This protects both the per-vcpu stolen/preempt-time variables, and the per-vcore variables while this vcpu is running the vcore. Finally, we now don't count time spent in userspace as stolen time. The task could be executing in userspace on behalf of the vcpu, or it could be preempted, or the vcpu could be genuinely stopped. Since we have no way of dividing up the time between these cases, we don't count any of it as stolen. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
8455d79e |
|
14-Oct-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Run virtual core whenever any vcpus in it can run Currently the Book3S HV code implements a policy on multi-threaded processors (i.e. POWER7) that requires all of the active vcpus in a virtual core to be ready to run before we run the virtual core. However, that causes problems on reset, because reset stops all vcpus except vcpu 0, and can also reduce throughput since all four threads in a virtual core have to wait whenever any one of them hits a hypervisor page fault. This relaxes the policy, allowing the virtual core to run as soon as any vcpu in it is runnable. With this, the KVMPPC_VCPU_STOPPED state and the KVMPPC_VCPU_BUSY_IN_HOST state have been combined into a single KVMPPC_VCPU_NOTREADY state, since we no longer need to distinguish between them. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
2f12f034 |
|
14-Oct-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fixes for late-joining threads If a thread in a virtual core becomes runnable while other threads in the same virtual core are already running in the guest, it is possible for the latecomer to join the others on the core without first pulling them all out of the guest. Currently this only happens rarely, when a vcpu is first started. This fixes some bugs and omissions in the code in this case. First, we need to check for VPA updates for the latecomer and make a DTL entry for it. Secondly, if it comes along while the master vcpu is doing a VPA update, we don't need to do anything since the master will pick it up in kvmppc_run_core. To handle this correctly we introduce a new vcore state, VCORE_STARTING. Thirdly, there is a race because we currently clear the hardware thread's hwthread_req before waiting to see it get to nap. A latecomer thread could have its hwthread_req cleared before it gets to test it, and therefore never increment the nap_count, leading to messages about wait_for_nap timeouts. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
913d3ff9a |
|
14-Oct-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3s HV: Don't access runnable threads list without vcore lock There were a few places where we were traversing the list of runnable threads in a virtual core, i.e. vc->runnable_threads, without holding the vcore spinlock. This extends the places where we hold the vcore spinlock to cover everywhere that we traverse that list. Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault, this moves the call of it from kvmppc_handle_exit out to kvmppc_vcpu_run, where we don't hold the vcore lock. In kvmppc_vcore_blocked, we don't actually need to check whether all vcpus are ceded and don't have any pending exceptions, since the caller has already done that. The caller (kvmppc_run_vcpu) wasn't actually checking for pending exceptions, so we add that. The change of if to while in kvmppc_run_vcpu is to make sure that we never call kvmppc_remove_runnable() when the vcore state is RUNNING or EXITING. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
7b444c67 |
|
14-Oct-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Fix some races in starting secondary threads Subsequent patches implementing in-kernel XICS emulation will make it possible for IPIs to arrive at secondary threads at arbitrary times. This fixes some races in how we start the secondary threads, which if not fixed could lead to occasional crashes of the host kernel. This makes sure that (a) we have grabbed all the secondary threads, and verified that they are no longer in the kernel, before we start any thread, (b) that the secondary thread loads its vcpu pointer after clearing the IPI that woke it up (so we don't miss a wakeup), and (c) that the secondary thread clears its vcpu pointer before incrementing the nap count. It also removes unnecessary setting of the vcpu and vcore pointers in the paca in kvmppc_core_vcpu_load. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
512691d4 |
|
14-Oct-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Allow KVM guests to stop secondary threads coming online When a Book3S HV KVM guest is running, we need the host to be in single-thread mode, that is, all of the cores (or at least all of the cores where the KVM guest could run) to be running only one active hardware thread. This is because of the hardware restriction in POWER processors that all of the hardware threads in the core must be in the same logical partition. Complying with this restriction is much easier if, from the host kernel's point of view, only one hardware thread is active. This adds two hooks in the SMP hotplug code to allow the KVM code to make sure that secondary threads (i.e. hardware threads other than thread 0) cannot come online while any KVM guest exists. The KVM code still has to check that any core where it runs a guest has the secondary threads offline, but having done that check it can now be sure that they will not come online while the guest is running. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
314e51b9 |
|
08-Oct-2012 |
Konstantin Khlebnikov <khlebnikov@openvz.org> |
mm: kill vma flag VM_RESERVED and mm->reserved_vm counter A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
55b665b0 |
|
25-Sep-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Provide a way for userspace to get/set per-vCPU areas The PAPR paravirtualization interface lets guests register three different types of per-vCPU buffer areas in its memory for communication with the hypervisor. These are called virtual processor areas (VPAs). Currently the hypercalls to register and unregister VPAs are handled by KVM in the kernel, and userspace has no way to know about or save and restore these registrations across a migration. This adds "register" codes for these three areas that userspace can use with the KVM_GET/SET_ONE_REG ioctls to see what addresses have been registered, and to register or unregister them. This will be needed for guest hibernation and migration, and is also needed so that userspace can unregister them on reset (otherwise we corrupt guest memory after reboot by writing to the VPAs registered by the previous kernel). The "register" for the VPA is a 64-bit value containing the address, since the length of the VPA is fixed. The "registers" for the SLB shadow buffer and dispatch trace log (DTL) are 128 bits long, consisting of the guest physical address in the high (first) 64 bits and the length in the low 64 bits. This also fixes a bug where we were calling init_vpa unconditionally, leading to an oops when unregistering the VPA. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a8bd19ef |
|
25-Sep-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S: Get/set guest FP regs using the GET/SET_ONE_REG interface This enables userspace to get and set all the guest floating-point state using the KVM_[GS]ET_ONE_REG ioctls. The floating-point state includes all of the traditional floating-point registers and the FPSCR (floating point status/control register), all the VMX/Altivec vector registers and the VSCR (vector status/control register), and on POWER7, the vector-scalar registers (note that each FP register is the high-order half of the corresponding VSR). Most of these are implemented in common Book 3S code, except for VSX on POWER7. Because HV and PR differ in how they store the FP and VSX registers on POWER7, the code for these cases is not common. On POWER7, the FP registers are the upper halves of the VSX registers vsr0 - vsr31. PR KVM stores vsr0 - vsr31 in two halves, with the upper halves in the arch.fpr[] array and the lower halves in the arch.vsr[] array, whereas HV KVM on POWER7 stores the whole VSX register in arch.vsr[]. Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix whitespace, vsx compilation] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a136a8bd |
|
25-Sep-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface This enables userspace to get and set various SPRs (special-purpose registers) using the KVM_[GS]ET_ONE_REG ioctls. With this, userspace can get and set all the SPRs that are part of the guest state, either through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or the KVM_[GS]ET_ONE_REG ioctls. The SPRs that are added here are: - DABR: Data address breakpoint register - DSCR: Data stream control register - PURR: Processor utilization of resources register - SPURR: Scaled PURR - DAR: Data address register - DSISR: Data storage interrupt status register - AMR: Authority mask register - UAMOR: User authority mask override register - MMCR0, MMCR1, MMCRA: Performance monitor unit control registers - PMC1..PMC8: Performance monitor unit counter registers In order to reduce code duplication between PR and HV KVM code, this moves the kvm_vcpu_ioctl_[gs]et_one_reg functions into book3s.c and centralizes the copying between user and kernel space there. The registers that are handled differently between PR and HV, and those that exist only in one flavor, are handled in kvmppc_[gs]et_one_reg() functions that are specific to each flavor. Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: minimal style fixes] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
964ee98c |
|
20-Sep-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Remove bogus update of physical thread IDs When making a vcpu non-runnable we incorrectly changed the thread IDs of all other threads on the core, just remove that code. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
dfe49dbd |
|
11-Sep-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly This adds an implementation of kvm_arch_flush_shadow_memslot for Book3S HV, and arranges for kvmppc_core_commit_memory_region to flush the dirty log when modifying an existing slot. With this, we can handle deletion and modification of memory slots. kvm_arch_flush_shadow_memslot calls kvmppc_core_flush_memslot, which on Book3S HV now traverses the reverse map chains to remove any HPT (hashed page table) entries referring to pages in the memslot. This gets called by generic code whenever deleting a memslot or changing the guest physical address for a memslot. We flush the dirty log in kvmppc_core_commit_memory_region for consistency with what x86 does. We only need to flush when an existing memslot is being modified, because for a new memslot the rmap array (which stores the dirty bits) is all zero, meaning that every page is considered clean already, and when deleting a memslot we obviously don't care about the dirty bits any more. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a66b48c3 |
|
11-Sep-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Move kvm->arch.slot_phys into memslot.arch Now that we have an architecture-specific field in the kvm_memory_slot structure, we can use it to store the array of page physical addresses that we need for Book3S HV KVM on PPC970 processors. This reduces the size of struct kvm_arch for Book3S HV, and also reduces the size of struct kvm_arch_memory_slot for other PPC KVM variants since the fields in it are now only compiled in for Book3S HV. This necessitates making the kvm_arch_create_memslot and kvm_arch_free_memslot operations specific to each PPC KVM variant. That in turn means that we now don't allocate the rmap arrays on Book3S PR and Book E. Since we now unpin pages and free the slot_phys array in kvmppc_core_free_memslot, we no longer need to do it in kvmppc_core_destroy_vm, since the generic code takes care to free all the memslots when destroying a VM. We now need the new memslot to be passed in to kvmppc_core_prepare_memory_region, since we need to initialize its arch.slot_phys member on Book3S HV. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
2c9097e4 |
|
11-Sep-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Take the SRCU read lock before looking up memslots The generic KVM code uses SRCU (sleeping RCU) to protect accesses to the memslots data structures against updates due to userspace adding, modifying or removing memory slots. We need to do that too, both to avoid accessing stale copies of the memslots and to avoid lockdep warnings. This therefore adds srcu_read_lock/unlock pairs around code that accesses and uses memslots. Since the real-mode handlers for H_ENTER, H_REMOVE and H_BULK_REMOVE need to access the memslots, and we don't want to call the SRCU code in real mode (since we have no assurance that it would only access the linear mapping), we hold the SRCU read lock for the VM while in the guest. This does mean that adding or removing memory slots while some vcpus are executing in the guest will block for up to two jiffies. This tradeoff is acceptable since adding/removing memory slots only happens rarely, while H_ENTER/H_REMOVE/H_BULK_REMOVE are performance-critical hot paths. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
081f323b |
|
01-Jun-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Drop locks around call to kvmppc_pin_guest_page At the moment we call kvmppc_pin_guest_page() in kvmppc_update_vpa() with two spinlocks held: the vcore lock and the vcpu->vpa_update_lock. This is not good, since kvmppc_pin_guest_page() calls down_read() and get_user_pages_fast(), both of which can sleep. This bug was introduced in 2e25aa5f ("KVM: PPC: Book3S HV: Make virtual processor area registration more robust"). This arranges to drop those spinlocks before calling kvmppc_pin_guest_page() and re-take them afterwards. Dropping the vcore lock in kvmppc_run_core() means we have to set the vcore_state field to VCORE_RUNNING before we drop the lock, so that other vcpus won't try to run this vcore. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
32fad281 |
|
03-May-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Make the guest hash table size configurable This adds a new ioctl to enable userspace to control the size of the guest hashed page table (HPT) and to clear it out when resetting the guest. The KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter a pointer to a u32 containing the desired order of the HPT (log base 2 of the size in bytes), which is updated on successful return to the actual order of the HPT which was allocated. There must be no vcpus running at the time of this ioctl. To enforce this, we now keep a count of the number of vcpus running in kvm->arch.vcpus_running. If the ioctl is called when a HPT has already been allocated, we don't reallocate the HPT but just clear it out. We first clear the kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold the kvm->lock mutex, it will prevent any vcpus from starting to run until we're done, and (b) it means that the first vcpu to run after we're done will re-establish the VRMA if necessary. If userspace doesn't call this ioctl before running the first vcpu, the kernel will allocate a default-sized HPT at that point. We do it then rather than when creating the VM, as the code did previously, so that userspace has a chance to do the ioctl if it wants. When allocating the HPT, we can allocate either from the kernel page allocator, or from the preallocated pool. If userspace is asking for a different size from the preallocated HPTs, we first try to allocate using the kernel page allocator. Then we try to allocate from the preallocated pool, and then if that fails, we try allocating decreasing sizes from the kernel page allocator, down to the minimum size allowed (256kB). Note that the kernel page allocator limits allocations to 1 << CONFIG_FORCE_MAX_ZONEORDER pages, which by default corresponds to 16MB (on 64-bit powerpc, at least). Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix module compilation] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
de6c0b02 |
|
08-May-2012 |
David Gibson <david@gibson.dropbear.id.au> |
KVM: PPC: Book3S HV: Fix refcounting of hugepages The H_REGISTER_VPA hcall implementation in HV Power KVM needs to pin some guest memory pages into host memory so that they can be safely accessed from usermode. It does this used get_user_pages_fast(). When the VPA is unregistered, or the VCPUs are cleaned up, these pages are released using put_page(). However, the get_user_pages() is invoked on the specific memory are of the VPA which could lie within hugepages. In case the pinned page is huge, we explicitly find the head page of the compound page before calling put_page() on it. At least with the latest kernel, this is not correct. put_page() already handles finding the correct head page of a compound, and also deals with various counts on the individual tail page which are important for transparent huge pages. We don't support transparent hugepages on Power, but even so, bypassing this count maintenance can lead (when the VM ends) to a hugepage being released back to the pool with a non-zero mapcount on one of the tail pages. This can then lead to a bad_page() when the page is released from the hugepage pool. This removes the explicit compound_head() call to correct this bug. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
54771e62 |
|
04-May-2012 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Emulator: clean up SPR reads and writes When reading and writing SPRs, every SPR emulation piece had to read or write the respective GPR the value was read from or stored in itself. This approach is pretty prone to failure. What if we accidentally implement mfspr emulation where we just do "break" and nothing else? Suddenly we would get a random value in the return register - which is always a bad idea. So let's consolidate the generic code paths and only give the core specific SPR handling code readily made variables to read/write from/to. Functionally, this patch doesn't change anything, but it increases the readability of the code and makes is less prone to bugs. Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
5b74716e |
|
26-Apr-2012 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
kvm/powerpc: Add new ioctl to retreive server MMU infos This is necessary for qemu to be able to pass the right information to the guest, such as the supported page sizes and corresponding encodings in the SLB and hash table, which can vary depending on the processor type, the type of KVM used (PR vs HV) and the version of KVM Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [agraf: fix compilation on hv, adjust for newer ioctl numbers] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
f31e65e1 |
|
15-Mar-2012 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM There is nothing in the code for emulating TCE tables in the kernel that prevents it from working on "PR" KVM... other than ifdef's and location of the code. This and moves the bulk of the code there to a new file called book3s_64_vio.c. This speeds things up a bit on my G5. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [agraf: fix for hv kvm, 32bit, whitespace] Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
0456ec4f |
|
02-Feb-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log This adds code to measure "stolen" time per virtual core in units of timebase ticks, and to report the stolen time to the guest using the dispatch trace log (DTL). The guest can register an area of memory for the DTL for a given vcpu. The DTL is a ring buffer where KVM fills in one entry every time it enters the guest for that vcpu. Stolen time is measured as time when the virtual core is not running, either because the vcore is not runnable (e.g. some of its vcpus are executing elsewhere in the kernel or in userspace), or when the vcpu thread that is running the vcore is preempted. This includes time when all the vcpus are idle (i.e. have executed the H_CEDE hypercall), which is OK because the guest accounts stolen time while idle as idle time. Each vcpu keeps a record of how much stolen time has been reported to the guest for that vcpu so far. When we are about to enter the guest, we create a new DTL entry (if the guest vcpu has a DTL) and report the difference between total stolen time for the vcore and stolen time reported so far for the vcpu as the "enqueue to dispatch" time in the DTL entry. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
2e25aa5f |
|
19-Feb-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Make virtual processor area registration more robust The PAPR API allows three sorts of per-virtual-processor areas to be registered (VPA, SLB shadow buffer, and dispatch trace log), and furthermore, these can be registered and unregistered for another virtual CPU. Currently we just update the vcpu fields pointing to these areas at the time of registration or unregistration. If this is done on another vcpu, there is the possibility that the target vcpu is using those fields at the time and could end up using a bogus pointer and corrupting memory. This fixes the race by making the target cpu itself do the update, so we can be sure that the update happens at a time when the fields aren't being used. Each area now has a struct kvmppc_vpa which is used to manage these updates. There is also a spinlock which protects access to all of the kvmppc_vpa structs, other than to the pinned_addr fields. (We could have just taken the spinlock when using the vpa, slb_shadow or dtl fields, but that would mean taking the spinlock on every guest entry and exit.) This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry', which is what the rest of the kernel uses. Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out the need to initialize vcpu->arch.vpa_update_lock. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
f0888f70 |
|
02-Feb-2012 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3S HV: Make secondary threads more robust against stray IPIs Currently on POWER7, if we are running the guest on a core and we don't need all the hardware threads, we do nothing to ensure that the unused threads aren't executing in the kernel (other than checking that they are offline). We just assume they're napping and we don't do anything to stop them trying to enter the kernel while the guest is running. This means that a stray IPI can wake up the hardware thread and it will then try to enter the kernel, but since the core is in guest context, it will execute code from the guest in hypervisor mode once it turns the MMU on, which tends to lead to crashes or hangs in the host. This fixes the problem by adding two new one-byte flags in the kvmppc_host_state structure in the PACA which are used to interlock between the primary thread and the unused secondary threads when entering the guest. With these flags, the primary thread can ensure that the unused secondaries are not already in kernel mode (i.e. handling a stray IPI) and then indicate that they should not try to enter the kernel if they do get woken for any reason. Instead they will go into KVM code, find that there is no vcpu to run, acknowledge and clear the IPI and go back to nap mode. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
ae3a197e |
|
28-Mar-2012 |
David Howells <dhowells@redhat.com> |
Disintegrate asm/system.h for PowerPC Disintegrate asm/system.h for PowerPC. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> cc: linuxppc-dev@lists.ozlabs.org
|
#
9cc815e4 |
|
16-Feb-2012 |
Danny Kukawka <danny.kukawka@bisect.de> |
arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice arch/powerpc/kvm/book3s_hv.c: included 'linux/sched.h' twice, remove the duplicate. Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
b4e70611 |
|
16-Jan-2012 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Convert RMA allocation into generic code We have code to allocate big chunks of linear memory on bootup for later use. This code is currently used for RMA allocation, but can be useful beyond that extent. Make it generic so we can reuse it for other stuff later. Signed-off-by: Alexander Graf <agraf@suse.de> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
31f3438e |
|
11-Dec-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Move kvm_vcpu_ioctl_[gs]et_one_reg down to platform-specific code This moves the get/set_one_reg implementation down from powerpc.c into booke.c, book3s_pr.c and book3s_hv.c. This avoids #ifdefs in C code, but more importantly, it fixes a bug on Book3s HV where we were accessing beyond the end of the kvm_vcpu struct (via the to_book3s() macro) and corrupting memory, causing random crashes and file corruption. On Book3s HV we only accept setting the HIOR to zero, since the guest runs in supervisor mode and its vectors are never offset from zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> [agraf update to apply on top of changed ONE_REG patches] Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
6b75e6bf |
|
07-Dec-2011 |
Sasha Levin <levinsasha928@gmail.com> |
KVM: PPC: Use the vcpu kmem_cache when allocating new VCPUs Currently the code kzalloc()s new VCPUs instead of using the kmem_cache which is created when KVM is initialized. Modify it to allocate VCPUs from that kmem_cache. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
82ed3616 |
|
14-Dec-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Book3s HV: Implement get_dirty_log using hardware changed bit This changes the implementation of kvm_vm_ioctl_get_dirty_log() for Book3s HV guests to use the hardware C (changed) bits in the guest hashed page table. Since this makes the implementation quite different from the Book3s PR case, this moves the existing implementation from book3s.c to book3s_pr.c and creates a new implementation in book3s_hv.c. That implementation calls kvmppc_hv_get_dirty_log() to do the actual work by calling kvm_test_clear_dirty on each page. It iterates over the HPTEs, clearing the C bit if set, and returns 1 if any C bit was set (including the saved C bit in the rmap entry). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
342d3db7 |
|
11-Dec-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Implement MMU notifiers for Book3S HV guests This adds the infrastructure to enable us to page out pages underneath a Book3S HV guest, on processors that support virtualized partition memory, that is, POWER7. Instead of pinning all the guest's pages, we now look in the host userspace Linux page tables to find the mapping for a given guest page. Then, if the userspace Linux PTE gets invalidated, kvm_unmap_hva() gets called for that address, and we replace all the guest HPTEs that refer to that page with absent HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit set, which will cause an HDSI when the guest tries to access them. Finally, the page fault handler is extended to reinstantiate the guest HPTE when the guest tries to access a page which has been paged out. Since we can't intercept the guest DSI and ISI interrupts on PPC970, we still have to pin all the guest pages on PPC970. We have a new flag, kvm->arch.using_mmu_notifiers, that indicates whether we can page guest pages out. If it is not set, the MMU notifier callbacks do nothing and everything operates as before. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
697d3899 |
|
11-Dec-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Implement MMIO emulation support for Book3S HV guests This provides the low-level support for MMIO emulation in Book3S HV guests. When the guest tries to map a page which is not covered by any memslot, that page is taken to be an MMIO emulation page. Instead of inserting a valid HPTE, we insert an HPTE that has the valid bit clear but another hypervisor software-use bit set, which we call HPTE_V_ABSENT, to indicate that this is an absent page. An absent page is treated much like a valid page as far as guest hcalls (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that an absent HPTE doesn't need to be invalidated with tlbie since it was never valid as far as the hardware is concerned. When the guest accesses a page for which there is an absent HPTE, it will take a hypervisor data storage interrupt (HDSI) since we now set the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults looks up the hash table and if it finds an absent HPTE mapping the requested virtual address, will switch to kernel mode and handle the fault in kvmppc_book3s_hv_page_fault(), which at present just calls kvmppc_hv_emulate_mmio() to set up the MMIO emulation. This is based on an earlier patch by Benjamin Herrenschmidt, but since heavily reworked. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
da9d1d7f |
|
11-Dec-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Allow use of small pages to back Book3S HV guests This relaxes the requirement that the guest memory be provided as 16MB huge pages, allowing it to be provided as normal memory, i.e. in pages of PAGE_SIZE bytes (4k or 64k). To allow this, we index the kvm->arch.slot_phys[] arrays with a small page index, even if huge pages are being used, and use the low-order 5 bits of each entry to store the order of the enclosing page with respect to normal pages, i.e. log_2(enclosing_page_size / PAGE_SIZE). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
c77162de |
|
11-Dec-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region() This removes the code from kvmppc_core_prepare_memory_region() that looked up the VMA for the region being added and called hva_to_page to get the pfns for the memory. We have no guarantee that there will be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION ioctl call; userspace can do that ioctl and then map memory into the region later. Instead we defer looking up the pfn for each memory page until it is needed, which generally means when the guest does an H_ENTER hcall on the page. Since we can't call get_user_pages in real mode, if we don't already have the pfn for the page, kvmppc_h_enter() will return H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back to kernel context. That calls kvmppc_get_guest_page() to get the pfn for the page, and then calls back to kvmppc_h_enter() to redo the HPTE insertion. When the first vcpu starts executing, we need to have the RMO or VRMA region mapped so that the guest's real mode accesses will work. Thus we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot starting at guest physical 0 now has RMO memory mapped there; if so it sets it up for the guest, otherwise on POWER7 it sets up the VRMA. The function that does that, kvmppc_map_vrma, is now a bit simpler, as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself. Since we are now potentially updating entries in the slot_phys[] arrays from multiple vcpu threads, we now have a spinlock protecting those updates to ensure that we don't lose track of any references to pages. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
93e60249 |
|
11-Dec-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Add an interface for pinning guest pages in Book3s HV guests This adds two new functions, kvmppc_pin_guest_page() and kvmppc_unpin_guest_page(), and uses them to pin the guest pages where the guest has registered areas of memory for the hypervisor to update, (i.e. the per-cpu virtual processor areas, SLB shadow buffers and dispatch trace logs) and then unpin them when they are no longer required. Although it is not strictly necessary to pin the pages at this point, since all guest pages are already pinned, later commits in this series will mean that guest pages aren't all pinned. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
b2b2f165 |
|
11-Dec-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Keep page physical addresses in per-slot arrays This allocates an array for each memory slot that is added to store the physical addresses of the pages in the slot. This array is vmalloc'd and accessed in kvmppc_h_enter using real_vmalloc_addr(). This allows us to remove the ram_pginfo field from the kvm_arch struct, and removes the 64GB guest RAM limit that we had. We use the low-order bits of the array entries to store a flag indicating that we have done get_page on the corresponding page, and therefore need to call put_page when we are finished with the page. Currently this is set for all pages except those in our special RMO regions. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
25051b5a |
|
08-Nov-2011 |
Scott Wood <scottwood@freescale.com> |
KVM: PPC: Move prepare_to_enter call site into subarch code This function should be called with interrupts disabled, to avoid a race where an exception is delivered after we check, but the resched kick is received before we disable interrupts (and thus doesn't actually trigger the exit code that would recheck exceptions). booke already does this properly in the lightweight exit case, but not on initial entry. For now, move the call of prepare_to_enter into subarch-specific code so that booke can do the right thing here. Ideally book3s would do the same thing, but I'm having a hard time seeing where it does any interrupt disabling of this sort (plus it has several additional call sites), so I'm deferring the book3s fix to someone more familiar with that code. book3s behavior should be unchanged by this patch. Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
7e28e60e |
|
08-Nov-2011 |
Scott Wood <scottwood@freescale.com> |
KVM: PPC: Rename deliver_interrupts to prepare_to_enter This function also updates paravirt int_pending, so rename it to be more obvious that this is a collection of checks run prior to (re)entering a guest. Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
ed7e3d1c |
|
15-Feb-2012 |
Danny Kukawka <danny.kukawka@bisect.de> |
arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice arch/powerpc/kvm/book3s_hv.c: included 'linux/sched.h' twice, remove the duplicate. Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
251da038 |
|
10-Nov-2011 |
Michael Neuling <mikey@neuling.org> |
KVM: PPC: fix kvmppc_start_thread() for CONFIG_SMP=N Currently kvmppc_start_thread() tries to wake other SMT threads via xics_wake_cpu(). Unfortunately xics_wake_cpu only exists when CONFIG_SMP=Y so when compiling with CONFIG_SMP=N we get: arch/powerpc/kvm/built-in.o: In function `.kvmppc_start_thread': book3s_hv.c:(.text+0xa1e0): undefined reference to `.xics_wake_cpu' The following should be fine since kvmppc_start_thread() shouldn't called to start non-zero threads when SMP=N since threads_per_core=1. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
de1d9248 |
|
09-Nov-2011 |
Michael Neuling <mikey@neuling.org> |
powerpc: Add hvcall.h include to book3s_hv.c If you build with KVM and UP it fails with the following due to a missing include. /arch/powerpc/kvm/book3s_hv.c: In function 'do_h_register_vpa': arch/powerpc/kvm/book3s_hv.c:156:10: error: 'H_PARAMETER' undeclared (first use in this function) arch/powerpc/kvm/book3s_hv.c:156:10: note: each undeclared identifier is reported only once for each function it appears in arch/powerpc/kvm/book3s_hv.c:192:12: error: 'H_RESOURCE' undeclared (first use in this function) arch/powerpc/kvm/book3s_hv.c:222:9: error: 'H_SUCCESS' undeclared (first use in this function) arch/powerpc/kvm/book3s_hv.c: In function 'kvmppc_pseries_do_hcall': arch/powerpc/kvm/book3s_hv.c:228:30: error: 'H_SUCCESS' undeclared (first use in this function) arch/powerpc/kvm/book3s_hv.c:232:7: error: 'H_CEDE' undeclared (first use in this function) arch/powerpc/kvm/book3s_hv.c:234:7: error: 'H_PROD' undeclared (first use in this function) arch/powerpc/kvm/book3s_hv.c:238:10: error: 'H_PARAMETER' undeclared (first use in this function) arch/powerpc/kvm/book3s_hv.c:250:7: error: 'H_CONFER' undeclared (first use in this function) arch/powerpc/kvm/book3s_hv.c:252:7: error: 'H_REGISTER_VPA' undeclared (first use in this function) make[2]: *** [arch/powerpc/kvm/book3s_hv.o] Error 1 Signed-off-by: Michael Neuling <mikey@neuling.org> cc: stable@kernel.org (3.1 only) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
66b15db6 |
|
27-May-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
powerpc: add export.h to files making use of EXPORT_SYMBOL With module.h being implicitly everywhere via device.h, the absence of explicitly including something for EXPORT_SYMBOL went unnoticed. Since we are heading to fix things up and clean module.h from the device.h file, we need to explicitly include these files now. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
19ccb76a |
|
23-Jul-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per core), whenever a CPU goes idle, we have to pull all the other hardware threads in the core out of the guest, because the H_CEDE hcall is handled in the kernel. This is inefficient. This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall in real mode. When a guest vcpu does an H_CEDE hcall, we now only exit to the kernel if all the other vcpus in the same core are also idle. Otherwise we mark this vcpu as napping, save state that could be lost in nap mode (mainly GPRs and FPRs), and execute the nap instruction. When the thread wakes up, because of a decrementer or external interrupt, we come back in at kvm_start_guest (from the system reset interrupt vector), find the `napping' flag set in the paca, and go to the resume path. This has some other ramifications. First, when starting a core, we now start all the threads, both those that are immediately runnable and those that are idle. This is so that we don't have to pull all the threads out of the guest when an idle thread gets a decrementer interrupt and wants to start running. In fact the idle threads will all start with the H_CEDE hcall returning; being idle they will just do another H_CEDE immediately and go to nap mode. This required some changes to kvmppc_run_core() and kvmppc_run_vcpu(). These functions have been restructured to make them simpler and clearer. We introduce a level of indirection in the wait queue that gets woken when external and decrementer interrupts get generated for a vcpu, so that we can have the 4 vcpus in a vcore using the same wait queue. We need this because the 4 vcpus are being handled by one thread. Secondly, when we need to exit from the guest to the kernel, we now have to generate an IPI for any napping threads, because an HDEC interrupt doesn't wake up a napping thread. Thirdly, we now need to be able to handle virtual external interrupts and decrementer interrupts becoming pending while a thread is napping, and deliver those interrupts to the guest when the thread wakes. This is done in kvmppc_cede_reentry, just before fast_guest_return. Finally, since we are not using the generic kvm_vcpu_block for book3s_hv, and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef from kvm_arch_vcpu_runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
af8f38b3 |
|
10-Aug-2011 |
Alexander Graf <agraf@suse.de> |
KVM: PPC: Add sanity checking to vcpu_run There are multiple features in PowerPC KVM that can now be enabled depending on the user's wishes. Some of the combinations don't make sense or don't work though. So this patch adds a way to check if the executing environment would actually be able to run the guest properly. It also adds sanity checks if PVR is set (should always be true given the current code flow), if PAPR is only used with book3s_64 where it works and that HV KVM is only used in PAPR mode. Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
9e368f29 |
|
28-Jun-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: book3s_hv: Add support for PPC970-family processors This adds support for running KVM guests in supervisor mode on those PPC970 processors that have a usable hypervisor mode. Unfortunately, Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to 1), but the YDL PowerStation does have a usable hypervisor mode. There are several differences between the PPC970 and POWER7 in how guests are managed. These differences are accommodated using the CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature bits. Notably, on PPC970: * The LPCR, LPID or RMOR registers don't exist, and the functions of those registers are provided by bits in HID4 and one bit in HID0. * External interrupts can be directed to the hypervisor, but unlike POWER7 they are masked by MSR[EE] in non-hypervisor modes and use SRR0/1 not HSRR0/1. * There is no virtual RMA (VRMA) mode; the guest must use an RMO (real mode offset) area. * The TLB entries are not tagged with the LPID, so it is necessary to flush the whole TLB on partition switch. Furthermore, when switching partitions we have to ensure that no other CPU is executing the tlbie or tlbsync instructions in either the old or the new partition, otherwise undefined behaviour can occur. * The PMU has 8 counters (PMC registers) rather than 6. * The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist. * The SLB has 64 entries rather than 32. * There is no mediated external interrupt facility, so if we switch to a guest that has a virtual external interrupt pending but the guest has MSR[EE] = 0, we have to arrange to have an interrupt pending for it so that we can get control back once it re-enables interrupts. We do that by sending ourselves an IPI with smp_send_reschedule after hard-disabling interrupts. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
969391c5 |
|
28-Jun-2011 |
Paul Mackerras <paulus@samba.org> |
powerpc, KVM: Split HVMODE_206 cpu feature bit into separate HV and architecture bits This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to indicate that we have a usable hypervisor mode, and another to indicate that the processor conforms to PowerISA version 2.06. We also add another bit to indicate that the processor conforms to ISA version 2.01 and set that for PPC970 and derivatives. Some PPC970 chips (specifically those in Apple machines) have a hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode is not useful in the sense that there is no way to run any code in supervisor mode (HV=0 PR=0). On these processors, the LPES0 and LPES1 bits in HID4 are always 0, and we use that as a way of detecting that hypervisor mode is not useful. Where we have a feature section in assembly code around code that only applies on POWER7 in hypervisor mode, we use a construct like END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) The definition of END_FTR_SECTION_IFSET is such that the code will be enabled (not overwritten with nops) only if all bits in the provided mask are set. Note that the CPU feature check in __tlbie() only needs to check the ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called if we are running bare-metal, i.e. in hypervisor mode. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
aa04b4cc |
|
28-Jun-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests This adds infrastructure which will be needed to allow book3s_hv KVM to run on older POWER processors, including PPC970, which don't support the Virtual Real Mode Area (VRMA) facility, but only the Real Mode Offset (RMO) facility. These processors require a physically contiguous, aligned area of memory for each guest. When the guest does an access in real mode (MMU off), the address is compared against a limit value, and if it is lower, the address is ORed with an offset value (from the Real Mode Offset Register (RMOR)) and the result becomes the real address for the access. The size of the RMA has to be one of a set of supported values, which usually includes 64MB, 128MB, 256MB and some larger powers of 2. Since we are unlikely to be able to allocate 64MB or more of physically contiguous memory after the kernel has been running for a while, we allocate a pool of RMAs at boot time using the bootmem allocator. The size and number of the RMAs can be set using the kvm_rma_size=xx and kvm_rma_count=xx kernel command line options. KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability of the pool of preallocated RMAs. The capability value is 1 if the processor can use an RMA but doesn't require one (because it supports the VRMA facility), or 2 if the processor requires an RMA for each guest. This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the pool and returns a file descriptor which can be used to map the RMA. It also returns the size of the RMA in the argument structure. Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION ioctl calls from userspace. To cope with this, we now preallocate the kvm->arch.ram_pginfo array when the VM is created with a size sufficient for up to 64GB of guest memory. Subsequently we will get rid of this array and use memory associated with each memslot instead. This moves most of the code that translates the user addresses into host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level to kvmppc_core_prepare_memory_region. Also, instead of having to look up the VMA for each page in order to check the page size, we now check that the pages we get are compound pages of 16MB. However, if we are adding memory that is mapped to an RMA, we don't bother with calling get_user_pages_fast and instead just offset from the base pfn for the RMA. Typically the RMA gets added after vcpus are created, which makes it inconvenient to have the LPCR (logical partition control register) value in the vcpu->arch struct, since the LPCR controls whether the processor uses RMA or VRMA for the guest. This moves the LPCR value into the kvm->arch struct and arranges for the MER (mediated external request) bit, which is the only bit that varies between vcpus, to be set in assembly code when going into the guest if there is a pending external interrupt request. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
371fefd6 |
|
28-Jun-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Allow book3s_hv guests to use SMT processor modes This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
54738c09 |
|
28-Jun-2011 |
David Gibson <dwg@au1.ibm.com> |
KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode This improves I/O performance for guests using the PAPR paravirtualization interface by making the H_PUT_TCE hcall faster, by implementing it in real mode. H_PUT_TCE is used for updating virtual IOMMU tables, and is used both for virtual I/O and for real I/O in the PAPR interface. Since this moves the IOMMU tables into the kernel, we define a new KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables. The ioctl returns a file descriptor which can be used to mmap the newly created table. The qemu driver models use them in the same way as userspace managed tables, but they can be updated directly by the guest with a real-mode H_PUT_TCE implementation, reducing the number of host/guest context switches during guest IO. There are certain circumstances where it is useful for userland qemu to write to the TCE table even if the kernel H_PUT_TCE path is used most of the time. Specifically, allowing this will avoid awkwardness when we need to reset the table. More importantly, we will in the future need to write the table in order to restore its state after a checkpoint resume or migration. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
a8606e20 |
|
28-Jun-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Handle some PAPR hcalls in the kernel This adds the infrastructure for handling PAPR hcalls in the kernel, either early in the guest exit path while we are still in real mode, or later once the MMU has been turned back on and we are in the full kernel context. The advantage of handling hcalls in real mode if possible is that we avoid two partition switches -- and this will become more important when we support SMT4 guests, since a partition switch means we have to pull all of the threads in the core out of the guest. The disadvantage is that we can only access the kernel linear mapping, not anything vmalloced or ioremapped, since the MMU is off. This also adds code to handle the following hcalls in real mode: H_ENTER Add an HPTE to the hashed page table H_REMOVE Remove an HPTE from the hashed page table H_READ Read HPTEs from the hashed page table H_PROTECT Change the protection bits in an HPTE H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table H_SET_DABR Set the data address breakpoint register Plus code to handle the following hcalls in the kernel: H_CEDE Idle the vcpu until an interrupt or H_PROD hcall arrives H_PROD Wake up a ceded vcpu H_REGISTER_VPA Register a virtual processor area (VPA) The code that runs in real mode has to be in the base kernel, not in the module, if KVM is compiled as a module. The real-mode code can only access the kernel linear mapping, not vmalloc or ioremap space. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|
#
de56a948 |
|
28-Jun-2011 |
Paul Mackerras <paulus@samba.org> |
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
|