#
183bdd16 |
|
13-Sep-2023 |
Binbin Wu <binbin.wu@linux.intel.com> |
KVM: x86: Use KVM-governed feature framework to track "LAM enabled" Use the governed feature framework to track if Linear Address Masking (LAM) is "enabled", i.e. if LAM can be used by the guest. Using the framework to avoid the relative expensive call guest_cpuid_has() during cr3 and vmexit handling paths for LAM. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-14-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
3098e6ec |
|
13-Sep-2023 |
Robert Hoo <robert.hu@linux.intel.com> |
KVM: x86: Virtualize LAM for user pointer Add support to allow guests to set the new CR3 control bits for Linear Address Masking (LAM) and add implementation to get untagged address for user pointers. LAM modifies the canonical check for 64-bit linear addresses, allowing software to use the masked/ignored address bits for metadata. Hardware masks off the metadata bits before using the linear addresses to access memory. LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for virtualization. When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of the two LAM control bits. However, when EPT is off, the actual CR3 used by the guest is generated from the shadow MMU root which is different from the CR3 that is *set* by the guest, and KVM needs to manually apply any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest. KVM manually checks guest's CR3 to make sure it points to a valid guest physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend this check to allow the two LAM control bits to be set. After check, LAM bits of guest CR3 will be stripped off to extract guest physical address. In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address. Extend such check to allow the new LAM control bits too. Note, LAM doesn't have a global control bit to turn on/off LAM completely, but purely depends on hardware's CPUID to determine it can be enabled or not. That means, when EPT is on, even when KVM doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o causing problem. This is an unfortunate virtualization hole. KVM could choose to intercept CR3 in this case and inject fault but this would hurt performance when running a normal VM w/o LAM support. This is undesirable. Just choose to let the guest do such illegal thing as the worst case is guest being killed when KVM eventually find out such illegal behaviour and that the guest is misbehaving. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
1affe455 |
|
14-Jul-2023 |
Yan Zhao <yan.y.zhao@intel.com> |
KVM: x86/mmu: Add helpers to return if KVM honors guest MTRRs Add helpers to check if KVM honors guest MTRRs instead of open coding the logic in kvm_tdp_page_fault(). Future fixes and cleanups will also need to determine if KVM should honor guest MTRRs, e.g. for CR0.CD toggling and and non-coherent DMA transitions. Provide an inner helper, __kvm_mmu_honors_guest_mtrrs(), so that KVM can check if guest MTRRs were honored when stopping non-coherent DMA. Note, there is no need to explicitly check that TDP is enabled, KVM clears shadow_memtype_mask when TDP is disabled, i.e. it's non-zero if and only if EPT is enabled. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065006.20201-1-yan.y.zhao@intel.com Link: https://lore.kernel.org/r/20230714065043.20258-1-yan.y.zhao@intel.com [sean: squash into a one patch, drop explicit TDP check massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
93284446 |
|
28-Jul-2023 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Don't bounce through page-track mechanism for guest PTEs Don't use the generic page-track mechanism to handle writes to guest PTEs in KVM's MMU. KVM's MMU needs access to information that should not be exposed to external page-track users, e.g. KVM needs (for some definitions of "need") the vCPU to query the current paging mode, whereas external users, i.e. KVMGT, have no ties to the current vCPU and so should never need the vCPU. Moving away from the page-track mechanism will allow dropping use of the page-track mechanism for KVM's own MMU, and will also allow simplifying and cleaning up the page-track APIs. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
cf9f4c0e |
|
04-Apr-2023 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for permission faults when emulating a guest memory access and CR0.WP may be guest owned. If the guest toggles only CR0.WP and triggers emulation of a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a stale CR0.WP, i.e. use stale protection bits metadata. Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't support per-bit interception controls for CR0. Don't bother checking for EPT vs. NPT as the "old == new" check will always be true under NPT, i.e. the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0 from the VMCB on VM-Exit). Reported-by: Mathias Krause <minipli@grsecurity.net> Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net Fixes: fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit") Tested-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
607475cf |
|
21-Mar-2023 |
Binbin Wu <binbin.wu@linux.intel.com> |
KVM: x86: Add helpers to query individual CR0/CR4 bits Add helpers to check if a specific CR0/CR4 bit is set to avoid a plethora of implicit casts from the "unsigned long" return of kvm_read_cr*_bits(), and to make each caller's intent more obvious. Defer converting helpers that do truly ugly casts from "unsigned long" to "int", e.g. is_pse(), to a future commit so that their conversion is more isolated. Opportunistically drop the superfluous pcid_enabled from kvm_set_cr3(); the local variable is used only once, immediately after its declaration. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20230322045824.22970-2-binbin.wu@linux.intel.com [sean: move "obvious" conversions to this commit, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
1f98f2bd |
|
21-Sep-2022 |
David Matlack <dmatlack@google.com> |
KVM: x86/mmu: Change tdp_mmu to a read-only parameter Change tdp_mmu to a read-only parameter and drop the per-vm tdp_mmu_enabled. For 32-bit KVM, make tdp_mmu_enabled a macro that is always false so that the compiler can continue omitting cals to the TDP MMU. The TDP MMU was introduced in 5.10 and has been enabled by default since 5.15. At this point there are no known functionality gaps between the TDP MMU and the shadow MMU, and the TDP MMU uses less memory and scales better with the number of vCPUs. In other words, there is no good reason to disable the TDP MMU on a live system. Purposely do not drop tdp_mmu=N support (i.e. do not force 64-bit KVM to always use the TDP MMU) since tdp_mmu=N is still used to get test coverage of KVM's shadow MMU TDP support, which is used in 32-bit KVM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
3af15ff4 |
|
21-Sep-2022 |
David Matlack <dmatlack@google.com> |
KVM: x86/mmu: Change tdp_mmu to a read-only parameter Change tdp_mmu to a read-only parameter and drop the per-vm tdp_mmu_enabled. For 32-bit KVM, make tdp_mmu_enabled a macro that is always false so that the compiler can continue omitting cals to the TDP MMU. The TDP MMU was introduced in 5.10 and has been enabled by default since 5.15. At this point there are no known functionality gaps between the TDP MMU and the shadow MMU, and the TDP MMU uses less memory and scales better with the number of vCPUs. In other words, there is no good reason to disable the TDP MMU on a live system. Purposely do not drop tdp_mmu=N support (i.e. do not force 64-bit KVM to always use the TDP MMU) since tdp_mmu=N is still used to get test coverage of KVM's shadow MMU TDP support, which is used in 32-bit KVM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0c29397a |
|
03-Aug-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: SVM: Disable SEV-ES support if MMIO caching is disable Disable SEV-ES if MMIO caching is disabled as SEV-ES relies on MMIO SPTEs generating #NPF(RSVD), which are reflected by the CPU into the guest as a #VC. With SEV-ES, the untrusted host, a.k.a. KVM, doesn't have access to the guest instruction stream or register state and so can't directly emulate in response to a #NPF on an emulated MMIO GPA. Disabling MMIO caching means guest accesses to emulated MMIO ranges cause #NPF(!PRESENT), and those flavors of #NPF cause automatic VM-Exits, not #VC. Adjust KVM's MMIO masks to account for the C-bit location prior to doing SEV(-ES) setup, and document that dependency between adjusting the MMIO SPTE mask and SEV(-ES) setup. Fixes: b09763da4dd8 ("KVM: x86/mmu: Add module param to disable MMIO caching (for testing)") Reported-by: Michael Roth <michael.roth@amd.com> Tested-by: Michael Roth <michael.roth@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220803224957.1285926-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
2ca3129e |
|
14-Jun-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs (PT64). SPTE and PT64 are _mostly_ the same, but the few differences are quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is very much specific to SPTEs. Opportunistically (and temporarily) move most guest macros into paging.h to clearly associate them with shadow paging, and to ensure that they're not used as of this commit. A future patch will eliminate them entirely. Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant calculation is hot enough (when using shadow paging with 32-bit guests) that adding a per-context helper is undesirable, and burying the computation in paging_tmpl.h with a forward declaration isn't exactly an improvement. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
42c88ff8 |
|
14-Jun-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Dedup macros for computing various page table masks Provide common helper macros to generate various masks, shifts, etc... for 32-bit vs. 64-bit page tables. Only the inputs differ, the actual calculations are identical. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b3fcdb04 |
|
14-Jun-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h Move a handful of one-off macros and helpers for 32-bit PSE paging into paging_tmpl.h and hide them behind "PTTYPE == 32". Under no circumstance should anything but 32-bit shadow paging care about PSE paging. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
3c5c3245 |
|
19-Apr-2022 |
Kai Huang <kai.huang@intel.com> |
KVM: VMX: Include MKTME KeyID bits in shadow_zero_check Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to include all MKTME KeyID bits to include them to shadow_zero_check. Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
e54f1ff2 |
|
19-Apr-2022 |
Kai Huang <kai.huang@intel.com> |
KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of high bits of physical address bits as 'KeyID' bits. Intel Trust Domain Extentions (TDX) further steals part of MKTME KeyID bits as TDX private KeyID bits. TDX private KeyID bits cannot be set in any mapping in the host kernel since they can only be accessed by software running inside a new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set any legacy MKTME KeyID bits to any mapping either. Therefore, it's not legitimate for KVM to set any KeyID bits in SPTE which maps guest memory. KVM maintains shadow_zero_check bits to represent which bits must be zero for SPTE which maps guest memory. MKTME KeyID bits should be set to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set the sme_me_mask to SPTE, and shadow_me_shadow is excluded from shadow_zero_check. So initializing shadow_me_mask to represent all MKTME keyID bits doesn't work for VMX (as oppositely, they must be set to shadow_zero_check). Introduce a new 'shadow_me_value' to replace existing shadow_me_mask, and repurpose shadow_me_mask as 'all possible memory encryption bits'. The new schematic of them will be: - shadow_me_value: the memory encryption bit(s) that will be set to the SPTE (the original shadow_me_mask). - shadow_me_mask: all possible memory encryption bits (which is a super set of shadow_me_value). - For now, shadow_me_value is supposed to be set by SVM and VMX respectively, and it is a constant during KVM's life time. This perhaps doesn't fit MKTME but for now host kernel doesn't support it (and perhaps will never do). - Bits in shadow_me_mask are set to shadow_zero_check, except the bits in shadow_me_value. Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them. Replace shadow_me_mask with shadow_me_value in almost all code paths, except the one in PT64_PERM_MASK, which is used by need_remote_flush() to determine whether remote TLB flush is needed. This should still use shadow_me_mask as any encryption bit change should need a TLB flush. And for AMD, move initializing shadow_me_value/shadow_me_mask from kvm_mmu_reset_all_pte_masks() to svm_hardware_setup(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
8a009d5b |
|
22-Apr-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Make all page fault handlers internal to the MMU Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all of the page fault handling collateral that was in mmu.h purely for the async #PF handler into mmu_internal.h, where it belongs. This will allow kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to expose those enums outside of the MMU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
a972e29c |
|
10-Feb-2022 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86/mmu: replace shadow_root_level with root_role.level root_role.level is always the same value as shadow_level: - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu - it's the level argument when going through kvm_init_shadow_ept_mmu - it's assigned directly from new_role.base.level when going through shadow_mmu_init_context Remove the duplication and get the level directly from the role. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
86931ff7 |
|
28-Apr-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
4f4aa80e |
|
11-Mar-2022 |
Lai Jiangshan <jiangshan.ljs@antgroup.com> |
KVM: X86: Handle implicit supervisor access with SMAP There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
8873c143 |
|
11-Mar-2022 |
Lai Jiangshan <jiangshan.ljs@antgroup.com> |
KVM: X86: Rename variable smap to not_smap in permission_fault() Comments above the variable says the bit is set when SMAP is overridden or the same meaning in update_permission_bitmask(): it is not subjected to SMAP restriction. Renaming it to reflect the negative implication and make the code better readability. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
5b22bbe7 |
|
11-Mar-2022 |
Lai Jiangshan <jiangshan.ljs@antgroup.com> |
KVM: X86: Change the type of access u32 to u64 Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
527d5cd7 |
|
25-Feb-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped Zap only obsolete roots when responding to zapping a single root shadow page. Because KVM keeps root_count elevated when stuffing a previous root into its PGD cache, shadowing a 64-bit guest means that zapping any root causes all vCPUs to reload all roots, even if their current root is not affected by the zap. For many kernels, zapping a single root is a frequent operation, e.g. in Linux it happens whenever an mm is dropped, e.g. process exits, etc... Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b9e5603c |
|
21-Feb-2022 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: use struct kvm_mmu_root_info for mmu->root The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
d6174299 |
|
09-Feb-2022 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: Reinitialize context if host userspace toggles EFER.LME While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and therefore EFER.NX is the only bit that can affect the MMU role. However, set_efer accepts a host-initiated change to EFER.LME even with CR0.PG=1. In that case, the MMU has to be reset. Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
00610021 |
|
25-Jan-2022 |
David Matlack <dmatlack@google.com> |
KVM: x86/mmu: Move is_writable_pte() to spte.h Move is_writable_pte() close to the other functions that check writability information about SPTEs. While here opportunistically replace the open-coded bit arithmetic in check_spte_writable_invariants() with a call to is_writable_pte(). No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
cc022ae1 |
|
24-Nov-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
KVM: X86: Add parameter huge_page_level to kvm_init_shadow_ept_mmu() The level of supported large page on nEPT affects the rsvds_bits_mask. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c59a0f57 |
|
24-Nov-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
KVM: X86: Remove mmu->translate_gpa Reduce an indirect function call (retpoline) and some intialization code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
61b05a9f |
|
19-Oct-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() kvm_mmu_unload() destroys all the PGD caches. Use the lighter kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
a91a7c70 |
|
18-Sep-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE X86_CR4_PGE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. It is also inconsistent that X86_CR4_PGE is in KVM_MMU_CR4_ROLE_BITS while kvm_mmu_role doesn't use X86_CR4_PGE. So X86_CR4_PGE is also removed from KVM_MMU_CR4_ROLE_BITS. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
2839180c |
|
29-Sep-2021 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86/mmu: clean up prefetch/prefault/speculative naming "prefetch", "prefault" and "speculative" are used throughout KVM to mean the same thing. Use a single name, standardizing on "prefetch" which is already used by various functions such as direct_pte_prefetch, FNAME(prefetch_gpte), FNAME(pte_prefetch), etc. Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
1e76a3ce |
|
14-Oct-2021 |
David Stevens <stevensd@chromium.org> |
KVM: cleanup allocation of rmaps and page tracking data Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: David Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
e710c5f6 |
|
24-Sep-2021 |
David Matlack <dmatlack@google.com> |
KVM: x86/mmu: Pass the memslot around via struct kvm_page_fault The memslot for the faulting gfn is used throughout the page fault handling code, so capture it in kvm_page_fault as soon as we know the gfn and use it in the page fault handling code that has direct access to the kvm_page_fault struct. Replace various tests using is_noslot_pfn with more direct tests on fault->slot being NULL. This, in combination with the subsequent patch, improves "Populate memory time" in dirty_log_perf_test by 5% when using the legacy MMU. There is no discerable improvement to the performance of the TDP MMU. No functional change intended. Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
73a3c659 |
|
07-Aug-2021 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: MMU: change kvm_mmu_hugepage_adjust() arguments to kvm_page_fault Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of extracting the arguments from the struct; the results are also stored in the struct, so the callers are adjusted consequently. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
3647cd04 |
|
07-Aug-2021 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: MMU: change kvm_faultin_pfn() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to outputs of kvm_faultin_pfn(). For now they have to be extracted again from struct kvm_page_fault in the subsequent steps, but this is temporary until other functions in the chain are switched over as well. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b8a5d551 |
|
06-Aug-2021 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: MMU: change page_fault_handle_page_track() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to the arguments of page_fault_handle_page_track(). The fields are initialized in the callers, and page_fault_handle_page_track() receives a struct kvm_page_fault instead of having to extract the arguments out of it. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
4326e57e |
|
06-Aug-2021 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: MMU: change direct_page_fault() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to the arguments of direct_page_fault(). The fields are initialized in the callers, and direct_page_fault() receives a struct kvm_page_fault instead of having to extract the arguments out of it. Also adjust FNAME(page_fault) to store the max_level in struct kvm_page_fault, to keep it similar to the direct map path. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c501040a |
|
06-Aug-2021 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: MMU: change mmu->page_fault() arguments to kvm_page_fault Pass struct kvm_page_fault to mmu->page_fault() instead of extracting the arguments from the struct. FNAME(page_fault) can use the precomputed bools from the error code. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
6defd9bb |
|
06-Aug-2021 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: MMU: Introduce struct kvm_page_fault Create a single structure for arguments that are passed from kvm_mmu_do_page_fault to the page fault handlers. Later the structure will grow to include various output parameters that are passed back to the next steps in the page fault handling. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
71f51d2c |
|
02-Aug-2021 |
Mingwei Zhang <mizhang@google.com> |
KVM: x86/mmu: Add detailed page size stats Existing KVM code tracks the number of large pages regardless of their sizes. Therefore, when large page of 1GB (or larger) is adopted, the information becomes less useful because lpages counts a mix of 1G and 2M pages. So remove the lpages since it is easy for user space to aggregate the info. Instead, provide a comprehensive page stats of all sizes from 4K to 512G. Suggested-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Cc: Jing Zhang <jingzhangos@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Message-Id: <20210803044607.599629-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
4139b197 |
|
30-Jul-2021 |
Peter Xu <peterx@redhat.com> |
KVM: X86: Introduce kvm_mmu_slot_lpages() helpers Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size. The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather than fetching from the memslot pointer. Start to use the latter one in kvm_alloc_memslot_metadata(). Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-4-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
fdaa2935 |
|
22-Jun-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault Use the current MMU instead of vCPU state to query CR0.WP when handling a page fault. In the nested NPT case, the current CR0.WP reflects L2, whereas the page fault is shadowing L1's NPT. Practically speaking, this is a nop a NPT walks are always user faults, but fix it up for consistency. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-53-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
16be1d12 |
|
22-Jun-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Move nested NPT reserved bit calculation into MMU proper Move nested NPT's invocation of reset_shadow_zero_bits_mask() into the MMU proper and unexport said function. Aside from dropping an export, this is a baby step toward eliminating the call entirely by fixing the shadow_root_level confusion. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
20f632bd |
|
22-Jun-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper Grab all CR0/CR4 MMU role bits from current vCPU state when initializing a non-nested shadow MMU. Extract the masks from kvm_post_set_cr{0,4}(), as the CR0/CR4 update masks must exactly match the mmu_role bits, with one exception (see below). The "full" CR0/CR4 will be used by future commits to initialize the MMU and its role, as opposed to the current approach of pulling everything from vCPU, which is incorrect for certain flows, e.g. nested NPT. CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU) but can't be toggled via MOV CR4 while long mode is active. I.e. LA57 needs to be in the mmu_role, but technically doesn't need to be checked by kvm_post_set_cr4(). However, the extra check is completely benign as the hardware restrictions simply mean LA57 will never be _the_ cause of a MMU reset during MOV CR4. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
dbc4739b |
|
22-Jun-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER as a u64 in various flows (mostly MMU). Passing the params as u32s is functionally ok since all of the affected registers reserve bits 63:32 to zero (enforced by KVM), but it's technically wrong. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c9060662 |
|
09-Jun-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Drop pointless @reset_roots from kvm_init_mmu() Remove the @reset_roots param from kvm_init_mmu(), the one user, kvm_mmu_reset_context() has already unloaded the MMU and thus freed and invalidated all roots. This also happens to be why the reset_roots=true paths doesn't leak roots; they're already invalid. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
d501f747 |
|
18-May-2021 |
Ben Gardon <bgardon@google.com> |
KVM: x86/mmu: Lazily allocate memslot rmaps If the TDP MMU is in use, wait to allocate the rmaps until the shadow MMU is actually used. (i.e. a nested VM is launched.) This saves memory equal to 0.2% of guest memory in cases where the TDP MMU is used and there are no nested guests involved. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-8-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
e2209710 |
|
18-May-2021 |
Ben Gardon <bgardon@google.com> |
KVM: x86/mmu: Skip rmap operations if rmaps not allocated If only the TDP MMU is being used to manage the memory mappings for a VM, then many rmap operations can be skipped as they are guaranteed to be no-ops. This saves some time which would be spent on the rmap operation. It also avoids acquiring the MMU lock in write mode for many operations. This makes it safe to run the VM without rmaps allocated, when only using the TDP MMU and sets the stage for waiting to allocate the rmaps until they're needed. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
e83bc09c |
|
05-Mar-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Get active PCID only when writing a CR3 value Retrieve the active PCID only when writing a guest CR3 value, i.e. don't get the PCID when using EPT or NPT. The PCID is especially problematic for EPT as the bits have different meaning, and so the PCID and must be manually stripped, which is annoying and unnecessary. And on VMX, getting the active PCID also involves reading the guest's CR3 and CR4.PCIDE, i.e. may add pointless VMREADs. Opportunistically rename the pgd/pgd_level params to root_hpa and root_level to better reflect their new roles. Keep the function names, as "load the guest PGD" is still accurate/correct. Last, and probably least, pass root_hpa as a hpa_t/u64 instead of an unsigned long. The EPTP holds a 64-bit value, even in 32-bit mode, so in theory EPT could support HIGHMEM for 32-bit KVM. Never mind that doing so would require changing the MMU page allocators and reworking the MMU to use kmap(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-2-seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
5fc3424f |
|
25-Feb-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Make Host-writable and MMU-writable bit locations dynamic Make the location of the HOST_WRITABLE and MMU_WRITABLE configurable for a given KVM instance. This will allow EPT to use high available bits, which in turn will free up bit 11 for a constant MMU_PRESENT bit. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
e7b7bdea |
|
25-Feb-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Move logic for setting SPTE masks for EPT into the MMU proper Let the MMU deal with the SPTE masks to avoid splitting the logic and knowledge across the MMU and VMX. The SPTE masks that are used for EPT are very, very tightly coupled to the MMU implementation. The use of available bits, the existence of A/D types, the fact that shadow_x_mask even exists, and so on and so forth are all baked into the MMU implementation. Cross referencing the params to the masks is also a nightmare, as pretty much every param is a u64. A future patch will make the location of the MMU_WRITABLE and HOST_WRITABLE bits MMU specific, to free up bit 11 for a MMU_PRESENT bit. Doing that change with the current kvm_mmu_set_mask_ptes() would be an absolute mess. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
8120337a |
|
25-Feb-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Stop using software available bits to denote MMIO SPTEs Stop tagging MMIO SPTEs with specific available bits and instead detect MMIO SPTEs by checking for their unique SPTE value. The value is guaranteed to be unique on shadow paging and NPT as setting reserved physical address bits on any other type of SPTE would consistute a KVM bug. Ditto for EPT, as creating a WX non-MMIO would also be a bug. Note, this approach is also future-compatibile with TDX, which will need to reflect MMIO EPT violations as #VEs into the guest. To create an EPT violation instead of a misconfig, TDX EPTs will need to have RWX=0, But, MMIO SPTEs will also be the only case where KVM clears SUPPRESS_VE, so MMIO SPTEs will still be guaranteed to have a unique value within a given MMU context. The main motivation is to make it easier to reason about which types of SPTEs use which available bits. As a happy side effect, this frees up two more bits for storing the MMIO generation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
61a1773e |
|
04-Mar-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Unexport MMU load/unload functions Unexport the MMU load and unload helpers now that they are no longer used (incorrectly) in vendor code. Opportunistically move the kvm_mmu_sync_roots() declaration into mmu.h, it should not be exposed to vendor code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b3646477 |
|
14-Jan-2021 |
Jason Baron <jbaron@akamai.com> |
KVM: x86: use static calls to reduce kvm_x86_ops overhead Convert kvm_x86_ops to use static calls. Note that all kvm_x86_ops are covered here except for 'pmu_ops and 'nested ops'. Here are some numbers running cpuid in a loop of 1 million calls averaged over 5 runs, measured in the vm (lower is better). Intel Xeon 3000MHz: |default |mitigations=off ------------------------------------- vanilla |.671s |.486s static call|.573s(-15%)|.458s(-6%) AMD EPYC 2500MHz: |default |mitigations=off ------------------------------------- vanilla |.710s |.609s static call|.664s(-6%) |.609s(0%) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Message-Id: <e057bf1b8a7ad15652df6eeba3f907ae758d3399.1610680941.git.jbaron@akamai.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
15e6a7e5 |
|
22-Jan-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Use boolean returns for (S)PTE accessors Return a 'bool' instead of an 'int' for various PTE accessors that are boolean in nature, e.g. is_shadow_present_pte(). Returning an int is goofy and potentially dangerous, e.g. if a flag being checked is moved into the upper 32 bits of a SPTE, then the compiler may silently squash the entire check since casting to an int is guaranteed to yield a return value of '0'. Opportunistically refactor is_last_spte() so that it naturally returns a bool value instead of letting it implicitly cast 0/1 to false/true. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210123003003.3137525-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
eb79cd00 |
|
13-Jan-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Add more protection against undefined behavior in rsvd_bits() Add compile-time asserts in rsvd_bits() to guard against KVM passing in garbage hardcoded values, and cap the upper bound at '63' for dynamic values to prevent generating a mask that would overflow a u64. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210113204515.3473079-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
2f80d502 |
|
22-Dec-2020 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: fix shift out of bounds reported by UBSAN Since we know that e >= s, we can reassociate the left shift, changing the shifted number from 1 to 2 in exchange for decreasing the right hand side by 1. Reported-by: syzbot+e87846c48bf72bc85311@syzkaller.appspotmail.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
dc46515c |
|
24-Sep-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Move illegal GPA helper out of the MMU code Rename kvm_mmu_is_illegal_gpa() to kvm_vcpu_is_illegal_gpa() and move it to cpuid.h so that's it's colocated with cpuid_maxphyaddr(). The helper is not MMU specific and will gain a user that is completely unrelated to the MMU in a future patch. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200924194250.19137-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
2a40b900 |
|
15-Jul-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Pull the PGD's level from the MMU instead of recalculating it Use the shadow_root_level from the current MMU as the root level for the PGD, i.e. for VMX's EPTP. This eliminates the weird dependency between VMX and the MMU where both must independently calculate the same root level for things to work correctly. Temporarily keep VMX's calculation of the level and use it to WARN if the incoming level diverges. Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell, and rename a 'cr3' param in the load_mmu_pgd prototype that managed to survive the cr3 purge. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
89786147 |
|
10-Jul-2020 |
Mohammed Gamal <mgamal@redhat.com> |
KVM: x86: Add helper functions for illegal GPA checking and page fault injection This patch adds two helper functions that will be used to support virtualizing MAXPHYADDR in both kvm-intel.ko and kvm.ko. kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA, while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits. Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-2-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0f04a2ac |
|
10-Jul-2020 |
Vitaly Kuznetsov <vkuznets@redhat.com> |
KVM: nSVM: split kvm_init_shadow_npt_mmu() from kvm_init_shadow_mmu() As a preparatory change for moving kvm_mmu_new_pgd() from nested_prepare_vmcb_save() to nested_svm_init_mmu_context() split kvm_init_shadow_npt_mmu() from kvm_init_shadow_mmu(). This also makes the code look more like nVMX (kvm_init_shadow_ept_mmu()). No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200710141157.1640173-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
6ca9a6f3 |
|
22-Jun-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Add MMU-internal header Add mmu/mmu_internal.h to hold declarations and definitions that need to be shared between various mmu/ files, but should not be used by anything outside of the MMU. Begin populating mmu_internal.h with declarations of the helpers used by page_track.c. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622202034.15093-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
afe8d7e6 |
|
22-Jun-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Move kvm_mmu_available_pages() into mmu.c Move kvm_mmu_available_pages() from mmu.h to mmu.c, it has a single caller and has no business being exposed via mmu.h. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622202034.15093-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
f25a9dec |
|
22-Jun-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Drop kvm_arch_write_log_dirty() wrapper Drop kvm_arch_write_log_dirty() in favor of invoking .write_log_dirty() directly from FNAME(update_accessed_dirty_bits). "kvm_arch" is usually used for x86 functions that are invoked from generic KVM, and implies that there are external callers, neither of which is true. Remove the check for a non-NULL kvm_x86_ops hook as the call is wrapped in PTTYPE_EPT and is unconditionally set by VMX. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
2dbebf7a |
|
22-Jun-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: nVMX: Plumb L2 GPA through to PML emulation Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all intents and purposes is vmx_write_pml_buffer(), instead of having the latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit update is the result of KVM emulation (rare for L2), then the GPA in the VMCS may be stale and/or hold a completely unrelated GPA. Fixes: c5f983f6e8455 ("nVMX: Implement emulated Page Modification Logging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
929d1cfa |
|
19-May-2020 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: MMU: pass arbitrary CR0/CR4/EFER to kvm_init_shadow_mmu This allows fetching the registers from the hsave area when setting up the NPT shadow MMU, and is needed for KVM_SET_NESTED_STATE (which runs long after the CR0, CR4 and EFER values in vcpu have been switched to hold L2 guest state). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
e7581cac |
|
19-May-2020 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: simplify is_mmio_spte We can simply look at bits 52-53 to identify MMIO entries in KVM's page tables. Therefore, there is no need to pass a mask to kvm_mmu_set_mmio_spte_mask. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
afaf0b2f |
|
21-Mar-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection Replace the kvm_x86_ops pointer in common x86 with an instance of the struct to save one pointer dereference when invoking functions. Copy the struct by value to set the ops during kvm_init(). Arbitrarily use kvm_x86_ops.hardware_enable to track whether or not the ops have been initialized, i.e. a vendor KVM module has been loaded. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200321202603.19355-7-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
727a7e27 |
|
05-Mar-2020 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: rename set_cr3 callback and related flags to load_mmu_pgd The set_cr3 callback is not setting the guest CR3, it is setting the root of the guest page tables, either shadow or two-dimensional. To make this clearer as well as to indicate that the MMU calls it via kvm_mmu_load_cr3, rename it to load_mmu_pgd. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
689f3bf2 |
|
03-Mar-2020 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: unify callbacks to load paging root Similar to what kvm-intel.ko is doing, provide a single callback that merges svm_set_cr3, set_tdp_cr3 and nested_svm_set_tdp_cr3. This lets us unify the set_cr3 and set_tdp_cr3 entries in kvm_x86_ops. I'm doing that in this same patch because splitting it adds quite a bit of churn due to the need for forward declarations. For the same reason the assignment to vcpu->arch.mmu->set_cr3 is moved to kvm_init_shadow_mmu from init_kvm_softmmu and nested_svm_init_mmu_context. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
7a02674d |
|
06-Feb-2020 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Avoid retpoline on ->page_fault() with TDP Wrap calls to ->page_fault() with a small shim to directly invoke the TDP fault handler when the kernel is using retpolines and TDP is being used. Single out the TDP fault handler and annotate the TDP path as likely to coerce the compiler into preferring it over the indirect function call. Rename tdp_page_fault() to kvm_tdp_page_fault(), as it's exposed outside of mmu.c to allow inlining the shim. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
1aa9b957 |
|
04-Nov-2019 |
Junaid Shahid <junaids@google.com> |
kvm: x86: mmu: Recovery of shattered NX large pages The page table pages corresponding to broken down large pages are zapped in FIFO order, so that the large page can potentially be recovered, if it is not longer being used for execution. This removes the performance penalty for walking deeper EPT page tables. By default, one large page will last about one hour once the guest reaches a steady state. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
#
4af77151 |
|
01-Aug-2019 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/mmu: Add explicit access mask for MMIO SPTEs When shadow paging is enabled, KVM tracks the allowed access type for MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit without having to walk the guest's page tables. The tracking is done by retaining the WRITE and USER bits of the access when inserting the MMIO SPTE (read access is implicitly allowed), which allows the MMIO page fault handler to retrieve and cache the WRITE/USER bits from the SPTE. Unfortunately for EPT, the mask used to retain the WRITE/USER bits is hardcoded using the x86 paging versions of the bits. This funkiness happens to work because KVM uses a completely different mask/value for MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to overlap exactly with the x86 WRITE/USER bits[*]. Explicitly define the access mask for MMIO SPTEs to accurately reflect that EPT does not want to incorporate any access bits into the SPTE, and so that KVM isn't subtly relying on EPT's WX bits always being set in MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks horribly. Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for EPT has no (known) functional impact. [*] Using WX to generate EPT misconfigurations (equivalent to reserved bit page fault) ensures KVM can employ its MMIO page fault tricks even platforms without reserved address bits. Fixes: ce88decffd17 ("KVM: MMU: mmio page fault support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
bc8a3d89 |
|
08-Apr-2019 |
Ben Gardon <bgardon@google.com> |
kvm: mmu: Fix overflow on kvm mmu page limit calculation KVM bases its memory usage limits on the total number of guest pages across all memslots. However, those limits, and the calculations to produce them, use 32 bit unsigned integers. This can result in overflow if a VM has more guest pages that can be represented by a u32. As a result of this overflow, KVM can use a low limit on the number of MMU pages it will allocate. This makes KVM unable to map all of guest memory at once, prompting spurious faults. Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch introduced no new failures. Signed-off-by: Ben Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
ea145aac |
|
05-Feb-2019 |
Sean Christopherson <seanjc@google.com> |
Revert "KVM: MMU: fast invalidate all pages" Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches from the original series[1], now that all users of the fast invalidate mechanism are gone. This reverts commit 5304b8d37c2a5ebca48330f5e7868d240eafbed1. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
44dd3ffa |
|
08-Oct-2018 |
Vitaly Kuznetsov <vkuznets@redhat.com> |
x86/kvm/mmu: make vcpu->mmu a pointer to the current MMU As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu a pointer to the currently used mmu. For now, this is always vcpu->arch.root_mmu. No functional change. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
|
#
4fef0f49 |
|
28-Sep-2018 |
Wei Yang <richard.weiyang@gmail.com> |
KVM: x86: move definition PT_MAX_HUGEPAGE_LEVEL and KVM_NR_PAGE_SIZES together Currently, there are two definitions related to huge page, but a little bit far from each other and seems loosely connected: * KVM_NR_PAGE_SIZES defines the number of different size a page could map * PT_MAX_HUGEPAGE_LEVEL means the maximum level of huge page The number of different size a page could map equals the maximum level of huge page, which is implied by current definition. While current implementation may not be kind to readers and further developers: * KVM_NR_PAGE_SIZES looks like a stand alone definition at first sight * in case we need to support more level, two places need to change This patch tries to make these two definition more close, so that reader and developer would feel more comfortable to manipulate. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c9470a2e |
|
27-Jun-2018 |
Junaid Shahid <junaids@google.com> |
kvm: x86: Propagate guest PCIDs to host PCIDs When using shadow paging mode, propagate the guest's PCID value to the shadow CR3 in the host instead of always using PCID 0. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
afe828d1 |
|
27-Jun-2018 |
Junaid Shahid <junaids@google.com> |
kvm: x86: Add ability to skip TLB flush when switching CR3 Remove the implicit flush from the set_cr3 handlers, so that the callers are able to decide whether to flush the TLB or not. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
50c28f21 |
|
27-Jun-2018 |
Junaid Shahid <junaids@google.com> |
kvm: x86: Use fast CR3 switch for nested VMX Use the fast CR3 switch mechanism to locklessly change the MMU root page when switching between L1 and L2. The switch from L2 to L1 should always go through the fast path, while the switch from L1 to L2 should go through the fast path if L1's CR3/EPTP for L2 hasn't changed since the last time. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
1c53da3f |
|
27-Jun-2018 |
Junaid Shahid <junaids@google.com> |
kvm: x86: Support resetting the MMU context without resetting roots This adds support for re-initializing the MMU context in a different mode while preserving the active root_hpa and the prev_root. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
6e42782f |
|
27-Jun-2018 |
Junaid Shahid <junaids@google.com> |
kvm: x86: Introduce KVM_REQ_LOAD_CR3 The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the current root_hpa. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b2441318 |
|
01-Nov-2017 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
d0006530 |
|
11-Aug-2017 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: SVM: limit kvm_handle_page_fault to #PF handling It has always annoyed me a bit how SVM_EXIT_NPF is handled by pf_interception. This is also the only reason behind the under-documented need_unprotect argument to kvm_handle_page_fault. Let NPF go straight to kvm_mmu_page_fault, just like VMX does in handle_ept_violation and handle_ept_misconfig. Reviewed-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
ea2800dd |
|
25-Aug-2017 |
Brijesh Singh <brijesh.singh@amd.com> |
kvm/x86: Avoid clearing the C-bit in rsvd_bits() The following commit: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM") uses __sme_clr() to remove the C-bit in rsvd_bits(). rsvd_bits() is just a simple function to return some 1 bits. Applying a mask based on properties of the host MMU is incorrect. Additionally, the masks computed by __reset_rsvds_bits_mask also apply to guest page tables, where the C bit is reserved since we don't emulate SME. The fix is to clear the C-bit from rsvd_bits_mask array after it has been populated from __reset_rsvds_bits_mask() Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: kvm@vger.kernel.org Cc: paolo.bonzini@gmail.com Fixes: d0ec49d ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Link: http://lkml.kernel.org/r/20170825205540.123531-1-brijesh.singh@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
b9dd21e1 |
|
23-Aug-2017 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: simplify handling of PKRU Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a simple comparison against the host value of the register. The write of PKRU in addition can be skipped if the guest has not enabled the feature. Once we do this, we need not test OSPKE in the host anymore, because guest_CR4.PKE=1 implies host_CR4.PKE=1. The static PKU test is kept to elide the code on older CPUs. Suggested-by: Yang Zhang <zy107165@alibaba-inc.com> Fixes: 1be0e61c1f255faaeab04a390e00c8b9b9042870 Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
855feb67 |
|
24-Aug-2017 |
Yu Zhang <yu.c.zhang@linux.intel.com> |
KVM: MMU: Add 5 level EPT & Shadow page table support. Extends the shadow paging code, so that 5 level shadow page table can be constructed if VM is running in 5 level paging mode. Also extends the ept code, so that 5 level ept table can be constructed if maxphysaddr of VM exceeds 48 bits. Unlike the shadow logic, KVM should still use 4 level ept table for a VM whose physical address width is less than 48 bits, even when the VM is running in 5 level paging mode. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> [Unconditionally reset the MMU context in kvm_cpuid_update. Changing MAXPHYADDR invalidates the reserved bit bitmasks. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
2a7266a8 |
|
24-Aug-2017 |
Yu Zhang <yu.c.zhang@linux.intel.com> |
KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL. Now we have 4 level page table and 5 level page table in 64 bits long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL, then we can use PT64_ROOT_5LEVEL for 5 level page table, it's helpful to make the code more clear. Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just redefine it to 5 whenever a replacement is needed for 5 level paging. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
d1cd3ce9 |
|
24-Aug-2017 |
Yu Zhang <yu.c.zhang@linux.intel.com> |
KVM: MMU: check guest CR3 reserved bits based on its physical address width. Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the reserved bits in CR3. Yet the length of reserved bits in guest CR3 should be based on the physical address width exposed to the VM. This patch changes CR3 check logic to calculate the reserved bits at runtime. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
e08d26f0 |
|
17-Aug-2017 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: simplify ept_misconfig Calling handle_mmio_page_fault() has been unnecessary since commit e9ee956e311d ("KVM: x86: MMU: Move handle_mmio_page_fault() call to kvm_mmu_page_fault()", 2016-02-22). handle_mmio_page_fault() can now be made static. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
#
d0ec49d4 |
|
17-Jul-2017 |
Tom Lendacky <thomas.lendacky@amd.com> |
kvm/x86/svm: Support Secure Memory Encryption within KVM Update the KVM support to work with SME. The VMCB has a number of fields where physical addresses are used and these addresses must contain the memory encryption mask in order to properly access the encrypted memory. Also, use the memory encryption mask when creating and using the nested page tables. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Toshimitsu Kani <toshi.kani@hpe.com> Cc: kasan-dev@googlegroups.com Cc: kvm@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-efi@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/89146eccfa50334409801ff20acd52a90fb5efcf.1500319216.git.thomas.lendacky@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
1261bfa3 |
|
13-Jul-2017 |
Wanpeng Li <wanpeng.li@hotmail.com> |
KVM: async_pf: Add L1 guest async_pf #PF vmexit handler This patch adds the L1 guest async page fault #PF vmexit handler, such by L1 similar to ordinary async page fault. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Passed insn parameters to kvm_mmu_page_fault().] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
#
dcdca5fe |
|
30-Jun-2017 |
Peter Feiner <pfeiner@google.com> |
x86: kvm: mmu: make spte mmio mask more explicit Specify both a mask (i.e., bits to consider) and a value (i.e., pattern of bits that indicates a special PTE) for mmio SPTEs. On Intel, this lets us pack even more information into the (SPTE_SPECIAL_MASK | EPT_VMX_RWX_MASK) mask we use for access tracking liberating all (SPTE_SPECIAL_MASK | (non-misconfigured-RWX)) values. Signed-off-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
9bc1f09f |
|
08-Jun-2017 |
Wanpeng Li <wanpeng.li@hotmail.com> |
KVM: async_pf: avoid async pf injection when in guest mode INFO: task gnome-terminal-:1734 blocked for more than 120 seconds. Not tainted 4.12.0-rc4+ #8 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gnome-terminal- D 0 1734 1015 0x00000000 Call Trace: __schedule+0x3cd/0xb30 schedule+0x40/0x90 kvm_async_pf_task_wait+0x1cc/0x270 ? __vfs_read+0x37/0x150 ? prepare_to_swait+0x22/0x70 do_async_page_fault+0x77/0xb0 ? do_async_page_fault+0x77/0xb0 async_page_fault+0x28/0x30 This is triggered by running both win7 and win2016 on L1 KVM simultaneously, and then gives stress to memory on L1, I can observed this hang on L1 when at least ~70% swap area is occupied on L0. This is due to async pf was injected to L2 which should be injected to L1, L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host actually), and L1 guest starts accumulating tasks stuck in D state in kvm_async_pf_task_wait() since missing PAGE_READY async_pfs. This patch fixes the hang by doing async pf when executing L1 guest. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
bab4165e |
|
05-May-2017 |
Bandan Das <bsd@redhat.com> |
kvm: x86: Add a hook for arch specific dirty logging emulation When KVM updates accessed/dirty bits, this hook can be used to invoke an arch specific function that implements/emulates dirty logging such as PML. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
ae1e2d10 |
|
30-Mar-2017 |
Paolo Bonzini <pbonzini@redhat.com> |
kvm: nVMX: support EPT accessed/dirty bits Now use bit 6 of EPTP to optionally enable A/D bits for EPTP. Another thing to change is that, when EPT accessed and dirty bits are not in use, VMX treats accesses to guest paging structures as data reads. When they are in use (bit 6 of EPTP is set), they are treated as writes and the corresponding EPT dirty bit is set. The MMU didn't know this detail, so this patch adds it. We also have to fix up the exit qualification. It may be wrong because KVM sets bit 6 but the guest might not. L1 emulates EPT A/D bits using write permissions, so in principle it may be possible for EPT A/D bits to be used by L1 even though not available in hardware. The problem is that guest page-table walks will be treated as reads rather than writes, so they would not cause an EPT violation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [Fixed typo in walk_addr_generic() comment and changed bit clear + conditional-set pattern in handle_ept_violation() to conditional-clear] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
#
812f30b2 |
|
12-Jul-2016 |
Bandan Das <bsd@redhat.com> |
kvm: mmu: remove is_present_gpte() We have two versions of the above function. To prevent confusion and bugs in the future, remove the non-FNAME version entirely and replace all calls with the actual check. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
7a98205d |
|
25-Mar-2016 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: fix permission_fault() kvm-unit-tests complained about the PFEC is not set properly, e.g,: test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15 expected 5 Dump mapping: address: 0x123400000000 ------L4: 3e95007 ------L3: 3e96007 ------L2: 2000083 It's caused by the reason that PFEC returned to guest is copied from the PFEC triggered by shadow page table This patch fixes it and makes the logic of updating errcode more clean Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> [Do not assume pfec.p=1. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
be94f6b7 |
|
22-Mar-2016 |
Huaitong Han <huaitong.han@intel.com> |
KVM, pkeys: add pkeys support for permission_fault Protection keys define a new 4-bit protection key field (PKEY) in bits 62:59 of leaf entries of the page tables, the PKEY is an index to PKRU register(16 domains), every domain has 2 bits(write disable bit, access disable bit). Static logic has been produced in update_pkru_bitmask, dynamic logic need read pkey from page table entries, get pkru value, and deduce the correct result. [ Huaitong: Xiao helps to modify many sections. ] Signed-off-by: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
f13577e8 |
|
08-Mar-2016 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: MMU: return page fault error code from permission_fault This will help in the implementation of PKRU, where the PK bit of the page fault error code cannot be computed in advance (unlike I/D, R/W and U/S). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
aeecee2e |
|
24-Feb-2016 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect Split rmap_write_protect() and introduce the function to abstract the write protection based on the slot This function will be used in the later patch Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
547ffaed |
|
24-Feb-2016 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage Abstract the common operations from account_shadowed() and unaccount_shadowed(), then introduce kvm_mmu_gfn_disallow_lpage() and kvm_mmu_gfn_allow_lpage() These two functions will be used by page tracking in the later patch Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
450869d6 |
|
04-Nov-2015 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common They are exactly the same, except that handle_mmio_page_fault has an unused argument and a call to WARN_ON. Remove the unused argument from the callers, and move the warning to (the former) handle_mmio_page_fault_common. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
f735d4af |
|
04-Aug-2015 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: VMX: drop ept misconfig check The logic used to check ept misconfig is completely contained in common reserved bits check for sptes, so it can be removed Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c258b62b |
|
04-Aug-2015 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: introduce the framework to check zero bits on sptes We have abstracted the data struct and functions which are used to check reserved bit on guest page tables, now we extend the logic to check zero bits on shadow page tables The zero bits on sptes include not only reserved bits on hardware but also the bits that SPTEs willnever use. For example, shadow pages will never use GB pages unless the guest uses them too. Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
efdfe536 |
|
13-May-2015 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
8a3d08f1 |
|
13-May-2015 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: introduce PT_MAX_HUGEPAGE_LEVEL Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
edc90b7d |
|
11-May-2015 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: fix SMAP virtualization KVM may turn a user page to a kernel page when kernel writes a readonly user page if CR0.WP = 1. This shadow page entry will be reused after SMAP is enabled so that kernel is allowed to access this user page Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu once CR4.SMAP is updated Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0be0226f |
|
11-May-2015 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: fix SMAP virtualization KVM may turn a user page to a kernel page when kernel writes a readonly user page if CR0.WP = 1. This shadow page entry will be reused after SMAP is enabled so that kernel is allowed to access this user page Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu once CR4.SMAP is updated Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
7cbeed9b |
|
07-May-2015 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: fix smap permission check Current permission check assumes that RSVD bit in PFEC is always zero, however, it is not true since MMIO #PF will use it to quickly identify MMIO access Fix it by clearing the bit if walking guest page table is needed Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
ceee7df7 |
|
07-May-2015 |
Xiao Guangrong <guangrong.xiao@linux.intel.com> |
KVM: MMU: fix smap permission check Current permission check assumes that RSVD bit in PFEC is always zero, however, it is not true since MMIO #PF will use it to quickly identify MMIO access Fix it by clearing the bit if walking guest page table is needed Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c205fb7d |
|
24-Dec-2014 |
Nadav Amit <namit@cs.technion.ac.il> |
KVM: x86: #PF error-code on R/W operations is wrong When emulating an instruction that reads the destination memory operand (i.e., instructions without the Mov flag in the emulator), the operand is first read. If a page-fault is detected in this phase, the error-code which would be delivered to the VM does not indicate that the access that caused the exception is a write one. This does not conform with real hardware, and may cause the VM to enter the page-fault handler twice for no reason (once for read, once for write). Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
ad896af0 |
|
02-Oct-2013 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: mmu: remove argument to kvm_init_shadow_mmu and kvm_init_shadow_ept_mmu The initialization function in mmu.c can always use walk_mmu, which is known to be vcpu->arch.mmu. Only init_kvm_nested_mmu is used to initialize vcpu->arch.nested_mmu. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
d1431483 |
|
01-Sep-2014 |
Tiejun Chen <tiejun.chen@intel.com> |
KVM: mmio: cleanup kvm_set_mmio_spte_mask Just reuse rsvd_bits() inside kvm_set_mmio_spte_mask() for slightly better code. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
198c74f4 |
|
17-Apr-2014 |
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> |
KVM: MMU: flush tlb out of mmu lock when write-protect the sptes Now we can flush all the TLBs out of the mmu lock without TLB corruption when write-proect the sptes, it is because: - we have marked large sptes readonly instead of dropping them that means we just change the spte from writable to readonly so that we only need to care the case of changing spte from present to present (changing the spte from present to nonpresent will flush all the TLBs immediately), in other words, the only case we need to care is mmu_spte_update() - in mmu_spte_update(), we haved checked SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK anymore Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
#
97ec8c06 |
|
01-Apr-2014 |
Feng Wu <feng.wu@intel.com> |
KVM: Add SMAP support when setting CR4 This patch adds SMAP handling logic when setting CR4 for guests Thanks a lot to Paolo Bonzini for his suggestion to use the branchless way to detect SMAP violation. Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
#
8a3c1a33 |
|
02-Oct-2013 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: mmu: change useless int return types to void kvm_mmu initialization is mostly filling in function pointers, there is no way for it to fail. Clean up unused return values. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
|
#
155a97a3 |
|
05-Aug-2013 |
Nadav Har'El <nyh@il.ibm.com> |
nEPT: MMU context for nested EPT KVM's existing shadow MMU code already supports nested TDP. To use it, we need to set up a new "MMU context" for nested EPT, and create a few callbacks for it (nested_ept_*()). This context should also use the EPT versions of the page table access functions (defined in the previous patch). Then, we need to switch back and forth between this nested context and the regular MMU context when switching between L1 and L2 (when L1 runs this L2 with EPT). Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
f8f55942 |
|
07-Jun-2013 |
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> |
KVM: MMU: fast invalidate all mmio sptes This patch tries to introduce a very simple and scale way to invalidate all mmio sptes - it need not walk any shadow pages and hold mmu-lock KVM maintains a global mmio valid generation-number which is stored in kvm->memslots.generation and every mmio spte stores the current global generation-number into his available bits when it is created When KVM need zap all mmio sptes, it just simply increase the global generation-number. When guests do mmio access, KVM intercepts a MMIO #PF then it walks the shadow page table and get the mmio spte. If the generation-number on the spte does not equal the global generation-number, it will go to the normal #PF handler to update the mmio spte Since 19 bits are used to store generation-number on mmio spte, we zap all mmio sptes when the number is round Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b37fbea6 |
|
07-Jun-2013 |
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> |
KVM: MMU: make return value of mmio page fault handler more readable Define some meaningful names instead of raw code Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
5304b8d3 |
|
30-May-2013 |
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> |
KVM: MMU: fast invalidate all pages The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to walk and zap all shadow pages one by one, also it need to zap all guest page's rmap and all shadow page's parent spte list. Particularly, things become worse if guest uses more memory or vcpus. It is not good for scalability In this patch, we introduce a faster way to invalidate all shadow pages. KVM maintains a global mmu invalid generation-number which is stored in kvm->arch.mmu_valid_gen and every shadow page stores the current global generation-number into sp->mmu_valid_gen when it is created When KVM need zap all shadow pages sptes, it just simply increase the global generation-number then reload root shadow pages on all vcpus. Vcpu will create a new shadow page table according to current kvm's generation-number. It ensures the old pages are not used any more. Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) are zapped by using lock-break technique Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
|
#
81f4f76b |
|
21-Mar-2013 |
Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> |
KVM: MMU: Rename kvm_mmu_free_some_pages() to make_mmu_pages_available() The current name "kvm_mmu_free_some_pages" should be used for something that actually frees some shadow pages, as we expect from the name, but what the function is doing is to make some, KVM_MIN_FREE_MMU_PAGES, shadow pages available: it does nothing when there are enough. This patch changes the name to reflect this meaning better; while doing this renaming, the code in the wrapper function is inlined into the main body since the whole function will be inlined into the only caller now. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
#
5d218814 |
|
12-Mar-2013 |
Marcelo Tosatti <mtosatti@redhat.com> |
KVM: MMU: make kvm_mmu_available_pages robust against n_used_mmu_pages > n_max_mmu_pages As noticed by Ulrich Obergfell <uobergfe@redhat.com>, the mmu counters are for beancounting purposes only - so n_used_mmu_pages and n_max_mmu_pages could be relaxed (example: before f0f5933a1626c8df7b), resulting in n_used_mmu_pages > n_max_mmu_pages. Make code robust against n_used_mmu_pages > n_max_mmu_pages. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
|
#
6fd01b71 |
|
12-Sep-2012 |
Avi Kivity <avi@redhat.com> |
KVM: MMU: Optimize is_last_gpte() Instead of branchy code depending on level, gpte.ps, and mmu configuration, prepare everything in a bitmap during mode changes and look it up during runtime. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
97d64b78 |
|
12-Sep-2012 |
Avi Kivity <avi@redhat.com> |
KVM: MMU: Optimize pte permission checks walk_addr_generic() permission checks are a maze of branchy code, which is performed four times per lookup. It depends on the type of access, efer.nxe, cr0.wp, cr4.smep, and in the near future, cr4.smap. Optimize this away by precalculating all variants and storing them in a bitmap. The bitmap is recalculated when rarely-changing variables change (cr0, cr4) and is indexed by the often-changing variables (page fault error code, pte access permissions). The permission check is moved to the end of the loop, otherwise an SMEP fault could be reported as a false positive, when PDE.U=1 but PTE.U=0. Noted by Xiao Guangrong. The result is short, branch-free code. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
8ea667f2 |
|
12-Sep-2012 |
Avi Kivity <avi@redhat.com> |
KVM: MMU: Push clean gpte write protection out of gpte_access() gpte_access() computes the access permissions of a guest pte and also write-protects clean gptes. This is wrong when we are servicing a write fault (since we'll be setting the dirty bit momentarily) but correct when instantiating a speculative spte, or when servicing a read fault (since we'll want to trap a following write in order to set the dirty bit). It doesn't seem to hurt in practice, but in order to make the code readable, push the write protection out of gpte_access() and into a new protect_clean_gpte() which is called explicitly when needed. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
ce88decf |
|
11-Jul-2011 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
KVM: MMU: mmio page fault support The idea is from Avi: | We could cache the result of a miss in an spte by using a reserved bit, and | checking the page fault error code (or seeing if we get an ept violation or | ept misconfiguration), so if we get repeated mmio on a page, we don't need to | search the slot list/tree. | (https://lkml.org/lkml/2011/2/22/221) When the page fault is caused by mmio, we cache the info in the shadow page table, and also set the reserved bits in the shadow page table, so if the mmio is caused again, we can quickly identify it and emulate it directly Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it can be reduced by this feature, and also avoid walking guest page table for soft mmu. [jan: fix operator precedence issue] Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
bebb106a |
|
11-Jul-2011 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
KVM: MMU: cache mmio info on page fault path If the page fault is caused by mmio, we can cache the mmio info, later, we do not need to walk guest page table and quickly know it is a mmio fault while we emulate the mmio instruction Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
0959ffac |
|
14-Sep-2010 |
Joerg Roedel <joerg.roedel@amd.com> |
KVM: MMU: Don't track nested fault info in error-code This patch moves the detection whether a page-fault was nested or not out of the error code and moves it into a separate variable in the fault struct. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
02f59dc9 |
|
10-Sep-2010 |
Joerg Roedel <joerg.roedel@amd.com> |
KVM: MMU: Introduce init_kvm_nested_mmu() This patch introduces the init_kvm_nested_mmu() function which is used to re-initialize the nested mmu when the l2 guest changes its paging mode. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
52fde8df |
|
10-Sep-2010 |
Joerg Roedel <joerg.roedel@amd.com> |
KVM: MMU: Introduce kvm_init_shadow_mmu helper function Some logic of the init_kvm_softmmu function is required to build the Nested Nested Paging context. So factor the required logic into a seperate function and export it. Also make the whole init path suitable for more than one mmu context. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
49d5ca26 |
|
19-Aug-2010 |
Dave Hansen <dave@linux.vnet.ibm.com> |
KVM: replace x86 kvm n_free_mmu_pages with n_used_mmu_pages Doing this makes the code much more readable. That's borne out by the fact that this patch removes code. "used" also happens to be the number that we need to return back to the slab code when our shrinker gets called. Keeping this value as opposed to free makes the next patch simpler. So, 'struct kvm' is kzalloc()'d. 'struct kvm_arch' is a structure member (and not a pointer) of 'struct kvm'. That means they start out zeroed. I _think_ they get initialized properly by kvm_mmu_change_mmu_pages(). But, that only happens via kvm ioctls. Another benefit of storing 'used' intead of 'free' is that the values are consistent from the moment the structure is allocated: no negative "used" value. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Tim Pepper <lnxninja@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
e0df7b9f |
|
19-Aug-2010 |
Dave Hansen <dave@linux.vnet.ibm.com> |
KVM: abstract kvm x86 mmu->n_free_mmu_pages "free" is a poor name for this value. In this context, it means, "the number of mmu pages which this kvm instance should be able to allocate." But "free" implies much more that the objects are there and ready for use. "available" is a much better description, especially when you see how it is calculated. In this patch, we abstract its use into a function. We'll soon replace the function's contents by calculating the value in a different way. All of the reads of n_free_mmu_pages are taken care of in this patch. The modification sites will be handled in a patch later in the series. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Tim Pepper <lnxninja@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
1871c602 |
|
10-Feb-2010 |
Gleb Natapov <gleb@redhat.com> |
KVM: x86 emulator: fix memory access during x86 emulation Currently when x86 emulator needs to access memory, page walk is done with broadest permission possible, so if emulated instruction was executed by userspace process it can still access kernel memory. Fix that by providing correct memory access to page walker during emulation. Signed-off-by: Gleb Natapov <gleb@redhat.com> Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
836a1b3c |
|
21-Jan-2010 |
Avi Kivity <avi@redhat.com> |
KVM: Move cr0/cr4/efer related helpers to x86.h They have more general scope than the mmu. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
#
4d4ec087 |
|
29-Dec-2009 |
Avi Kivity <avi@redhat.com> |
KVM: Replace read accesses of vcpu->arch.cr0 by an accessor Since we'd like to allow the guest to own a few bits of cr0 at times, we need to know when we access those bits. Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
c9c54174 |
|
05-Jan-2010 |
Sheng Yang <sheng@linux.intel.com> |
KVM: x86: Moving PT_*_LEVEL to mmu.h We can use them in x86.c and vmx.c now... Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
#
fc78f519 |
|
06-Dec-2009 |
Avi Kivity <avi@redhat.com> |
KVM: Add accessor for reading cr4 (or some bits of cr4) Some bits of cr4 can be owned by the guest on vmx, so when we read them, we copy them to the vcpu structure. In preparation for making the set of guest-owned bits dynamic, use helpers to access these bits so we don't need to know where the bit resides. No changes to svm since all bits are host-owned there. Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
94d8b056 |
|
10-Jun-2009 |
Marcelo Tosatti <mtosatti@redhat.com> |
KVM: MMU: add kvm_mmu_get_spte_hierarchy helper Required by EPT misconfiguration handler. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
43a3795a |
|
10-Jun-2009 |
Avi Kivity <avi@redhat.com> |
KVM: MMU: Adjust pte accessors to explicitly indicate guest or shadow pte Since the guest and host ptes can have wildly different format, adjust the pte accessor names to indicate on which type of pte they operate on. No functional changes. Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
20c466b5 |
|
31-Mar-2009 |
Dong, Eddie <eddie.dong@intel.com> |
KVM: Use rsvd_bits_mask in load_pdptrs() Also remove bit 5-6 from rsvd_bits_mask per latest SDM. Signed-off-by: Eddie Dong <Eddie.Dong@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
|
#
41d6af11 |
|
28-Feb-2008 |
Amit Shah <amit.shah@qumranet.com> |
KVM: is_long_mode() should check for EFER.LMA is_long_mode currently checks the LongModeEnable bit in EFER instead of the LongModeActive bit. This is wrong, but we survived this till now since it wasn't triggered. This breaks guests that go from long mode to compatibility mode. This is noticed on a solaris guest and fixes bug #1842160 Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
|
#
1b7fcd32 |
|
15-May-2008 |
Avi Kivity <avi@qumranet.com> |
KVM: MMU: Fix false flooding when a pte points to page table The KVM MMU tries to detect when a speculative pte update is not actually used by demand fault, by checking the accessed bit of the shadow pte. If the shadow pte has not been accessed, we deem that page table flooded and remove the shadow page table, allowing further pte updates to proceed without emulation. However, if the pte itself points at a page table and only used for write operations, the accessed bit will never be set since all access will happen through the emulator. This is exactly what happens with kscand on old (2.4.x) HIGHMEM kernels. The kernel points a kmap_atomic() pte at a page table, and then proceeds with read-modify-write operations to look at the dirty and accessed bits. We get a false flood trigger on the kmap ptes, which results in the mmu spending all its time setting up and tearing down shadows. Fix by setting the shadow accessed bit on emulated accesses. Signed-off-by: Avi Kivity <avi@qumranet.com>
|
#
67253af5 |
|
24-Apr-2008 |
Sheng Yang <sheng.yang@intel.com> |
KVM: Add kvm_x86_ops get_tdp_level() The function get_tdp_level() provided the number of tdp level for EPT and NPT rather than the NPT specific macro. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
|
#
8c6d6adc |
|
24-Apr-2008 |
Sheng Yang <sheng.yang@intel.com> |
KVM: MMU: Move some definitions to a header file Move some definitions to mmu.h in order to allow building common table entries between EPT and non-EPT. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
|
#
fb72d167 |
|
07-Feb-2008 |
Joerg Roedel <joerg.roedel@amd.com> |
KVM: MMU: add TDP support to the KVM MMU This patch contains the changes to the KVM MMU necessary for support of the Nested Paging feature in AMD Barcelona and Phenom Processors. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
|
#
edf88417 |
|
16-Dec-2007 |
Avi Kivity <avi@qumranet.com> |
KVM: Move arch dependent files to new directory arch/x86/kvm/ This paves the way for multiple architecture support. Note that while ioapic.c could potentially be shared with ia64, it is also moved. Signed-off-by: Avi Kivity <avi@qumranet.com>
|