History log of /linux-master/arch/x86/kvm/x86.h
Revision Date Author Comments
# 451a7078 27-Feb-2024 David Woodhouse <dwmw@amazon.co.uk>

KVM: x86/xen: improve accuracy of Xen timers

A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org
[sean: massage moved comment, tweak if statement formatting]
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 812d4323 05-Dec-2023 Like Xu <likexu@tencent.com>

KVM: x86/pmu: Explicitly check NMI from guest to reducee false positives

Explicitly check that the source of external interrupt is indeed an NMI
in kvm_arch_pmi_in_guest(), which reduces perf-kvm false positive samples
(host samples labelled as guest samples) generated by perf/core NMI mode
if an NMI arrives after VM-Exit, but before kvm_after_interrupt():

# test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
# Symbol Overhead sys usr guest sys guest usr
# ....................................... ........ ........ ........ ......... .........
#
# Before:
[g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00%
[g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00%
[g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00%
# After:
[k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00%
[k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00%
[k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00%

In the above case, perf records the samples labelled '[g]', the RIPs behind
the weird samples are actually being queried by perf_instruction_pointer()
after determining whether it's in GUEST state or not, and here's the issue:

If VM-Exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().

During this window, if a PMI occurs on host (since the KVM instructions on
host are being executed), the control flow, with the help of the host NMI
context, will be transferred to perf/core to generate performance samples,
thus perf_instruction_pointer() and perf_guest_get_ip() is called.

Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
cause perf/core to mistakenly assume that the source RIP of the host NMI
belongs to the guest world and use perf_guest_get_ip() to get the RIP of
a vCPU that has already exited by a non-NMI interrupt.

Error samples are recorded and presented to the end-user via perf-report.
Such false positive samples could be eliminated by explicitly determining
if the exit reason is KVM_HANDLING_NMI.

Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI
is cleared, it's also still possible that another PMI is generated on host.
Also for perf/core timer mode, the false positives are still possible since
those non-NMI sources of interrupts are not always being used by perf/core.

For events that are host-only, perf/core can and should eliminate false
positives by checking event->attr.exclude_guest, i.e. events that are
configured to exclude KVM guests should never fire in the guest.

Events that are configured to count host and guest are trickier, perhaps
impossible to handle with 100% accuracy? And regardless of what accuracy
is provided by perf/core, improving KVM's accuracy is cheap and easy, with
no real downsides.

Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231206032054.55070-1-likexu@tencent.com
[sean: massage changelog, squash !!in_nmi() fixup from Like]
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 93d1c9f4 13-Sep-2023 Robert Hoo <robert.hu@linux.intel.com>

KVM: x86: Virtualize LAM for supervisor pointer

Add support to allow guests to set the new CR4 control bit for LAM and add
implementation to get untagged address for supervisor pointers.

LAM modifies the canonicality check applied to 64-bit linear addresses for
data accesses, allowing software to use of the untranslated address bits for
metadata and masks the metadata bits before using them as linear addresses
to access memory. LAM uses CR4.LAM_SUP (bit 28) to configure and enable LAM
for supervisor pointers. It also changes VMENTER to allow the bit to be set
in VMCS's HOST_CR4 and GUEST_CR4 to support virtualization. Note CR4.LAM_SUP
is allowed to be set even not in 64-bit mode, but it will not take effect
since LAM only applies to 64-bit linear addresses.

Move CR4.LAM_SUP out of CR4_RESERVED_BITS, its reservation depends on vcpu
supporting LAM or not. Leave it intercepted to prevent guest from setting
the bit if LAM is not exposed to guest as well as to avoid vmread every time
when KVM fetches its value, with the expectation that guest won't toggle the
bit frequently.

Set CR4.LAM_SUP bit in the emulated IA32_VMX_CR4_FIXED1 MSR for guests to
allow guests to enable LAM for supervisor pointers in nested VMX operation.

Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, KVM
doesn't need to emulate TLB flush based on it. There's no other features
or vmx_exec_controls connection, and no other code needed in
{kvm,vmx}_set_cr4().

Skip address untag for instruction fetches (which includes branch targets),
operand of INVLPG instructions, and implicit system accesses, all of which
are not subject to untagging. Note, get_untagged_addr() isn't invoked for
implicit system accesses as there is no reason to do so, but check the
flag anyways for documentation purposes.

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-11-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 5d6d6a7d 05-Oct-2023 David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Refine calculation of guest wall clock to use a single TSC read

When populating the guest's PV wall clock information, KVM currently does
a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern
which should be avoided; when working with the relationship between two
clocks, it's never correct to obtain one of them "now" and then the other
at a slightly different "now" after an unspecified period of preemption
(which might not even be under the control of the kernel, if this is an
L1 hosting an L2 guest under nested virtualization).

Add a kvm_get_wall_clock_epoch() function to return the guest wall clock
epoch in nanoseconds using the same method as __get_kvmclock() — by using
kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM
clock time from a *single* TSC reading.

The condition using get_cpu_tsc_khz() is equivalent to the version in
__get_kvmclock() which separately checks for the CONSTANT_TSC feature or
the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/bfc6d3d7cfb88c47481eabbf5a30a264c58c7789.camel@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>


# a2fd5d02 06-Jun-2023 Sean Christopherson <seanjc@google.com>

KVM: x86: Snapshot host's MSR_IA32_ARCH_CAPABILITIES

Snapshot the host's MSR_IA32_ARCH_CAPABILITIES, if it's supported, instead
of reading the MSR every time KVM wants to query the host state, e.g. when
initializing the default value during vCPU creation. The paths that query
ARCH_CAPABILITIES aren't particularly performance sensitive, but creating
vCPUs is a frequent enough operation that burning 8 bytes is a good
trade-off.

Alternatively, KVM could add a field in kvm_caps and thus skip the
on-demand calculations entirely, but a pure snapshot isn't possible due to
the way KVM handles the l1tf_vmx_mitigation module param. And unlike the
other "supported" fields in kvm_caps, KVM doesn't enforce the "supported"
value, i.e. KVM treats ARCH_CAPABILITIES like a CPUID leaf and lets
userspace advertise whatever it wants. Those problems are solvable, but
it's not clear there is real benefit versus snapshotting the host value,
and grabbing the host value will allow additional cleanup of KVM's
FB_CLEAR_CTRL code.

Link: https://lore.kernel.org/all/20230524061634.54141-2-chao.gao@intel.com
Cc: Chao Gao <chao.gao@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230607004311.1420507-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 3a5f4907 11-May-2023 Sean Christopherson <seanjc@google.com>

KVM: x86: Make kvm_mtrr_valid() static now that there are no external users

Make kvm_mtrr_valid() local to mtrr.c now that it's not used to check the
validity of a PAT MSR value.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230511233351.635053-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 55cd57b5 04-Apr-2023 Sean Christopherson <seanjc@google.com>

KVM: x86: Filter out XTILE_CFG if XTILE_DATA isn't permitted

Filter out XTILE_CFG from the supported XCR0 reported to userspace if the
current process doesn't have access to XTILE_DATA. Attempting to set
XTILE_CFG in XCR0 will #GP if XTILE_DATA is also not set, and so keeping
XTILE_CFG as supported results in explosions if userspace feeds
KVM_GET_SUPPORTED_CPUID back into KVM and the guest doesn't sanity check
CPUID.

Fixes: 445ecdf79be0 ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID")
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Aaron Lewis <aaronlewis@google.com>
Tested-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20230405004520.421768-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 6be3ae45 04-Apr-2023 Aaron Lewis <aaronlewis@google.com>

KVM: x86: Add a helper to handle filtering of unpermitted XCR0 features

Add a helper, kvm_get_filtered_xcr0(), to dedup code that needs to account
for XCR0 features that require explicit opt-in on a per-process basis. In
addition to documenting when KVM should/shouldn't consult
xstate_get_guest_group_perm(), the helper will also allow sanitizing the
filtered XCR0 to avoid enumerating architecturally illegal XCR0 values,
e.g. XTILE_CFG without XTILE_DATA.

No functional changes intended.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
[sean: rename helper, move to x86.h, massage changelog]
Reviewed-by: Aaron Lewis <aaronlewis@google.com>
Tested-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20230405004520.421768-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 5757f5b9 10-Mar-2023 Sean Christopherson <seanjc@google.com>

KVM: x86: Add macros to track first...last VMX feature MSRs

Add macros to track the range of VMX feature MSRs that are emulated by
KVM to reduce the maintenance cost of extending the set of emulated MSRs.

Note, KVM doesn't necessarily emulate all known/consumed VMX MSRs, e.g.
PROCBASED_CTLS3 is consumed by KVM to enable IPI virtualization, but is
not emulated as KVM doesn't emulate/virtualize IPI virtualization for
nested guests.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# fb3146b4 10-Mar-2023 Sean Christopherson <seanjc@google.com>

KVM: x86: Add a helper to query whether or not a vCPU has ever run

Add a helper to query if a vCPU has run so that KVM doesn't have to open
code the check on last_vmentry_cpu being set to a magic value.

No functional change intended.

Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 68f7c82a 21-Mar-2023 Binbin Wu <binbin.wu@linux.intel.com>

KVM: x86: Change return type of is_long_mode() to bool

Change return type of is_long_mode() to bool to avoid implicit cast,
as literally every user of is_long_mode() treats its return value as a
boolean.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-5-binbin.wu@linux.intel.com
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>


# bede6eb4 21-Mar-2023 Binbin Wu <binbin.wu@linux.intel.com>

KVM: x86: Use boolean return value for is_{pae,pse,paging}()

Convert is_{pae,pse,paging}() to use kvm_is_cr{0,4}_bit_set() and return
bools. Returning an "int" requires not one, but two implicit casts, first
from "unsigned long" to "int", and then again to a "bool". Both casts are
more than a bit dangerous; the ulong=>int casts would drop a bit on 64-bit
kernels _if_ the bits in question weren't in the lower 32 bits, and the
int=>bool cast can result in false negatives/positives, e.g. see commit
0c928ff26bd6 ("KVM: SVM: Fix benign "bool vs. int" comparison in
svm_set_cr0()").

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-3-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 607475cf 21-Mar-2023 Binbin Wu <binbin.wu@linux.intel.com>

KVM: x86: Add helpers to query individual CR0/CR4 bits

Add helpers to check if a specific CR0/CR4 bit is set to avoid a plethora
of implicit casts from the "unsigned long" return of kvm_read_cr*_bits(),
and to make each caller's intent more obvious.

Defer converting helpers that do truly ugly casts from "unsigned long" to
"int", e.g. is_pse(), to a future commit so that their conversion is more
isolated.

Opportunistically drop the superfluous pcid_enabled from kvm_set_cr3();
the local variable is used only once, immediately after its declaration.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-2-binbin.wu@linux.intel.com
[sean: move "obvious" conversions to this commit, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>


# e76ae527 24-Jan-2023 Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Gate all "unimplemented MSR" prints on report_ignored_msrs

Add helpers to print unimplemented MSR accesses and condition all such
prints on report_ignored_msrs, i.e. honor userspace's request to not
print unimplemented MSRs. Even though vcpu_unimpl() is ratelimited,
printing can still be problematic, e.g. if a print gets stalled when host
userspace is writing MSRs during live migration, an effective stall can
result in very noticeable disruption in the guest.

E.g. the profile below was taken while calling KVM_SET_MSRS on the PMU
counters while the PMU was disabled in KVM.

- 99.75% 0.00% [.] __ioctl
- __ioctl
- 99.74% entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_ioctl
- do_vfs_ioctl
- 92.48% kvm_vcpu_ioctl
- kvm_arch_vcpu_ioctl
- 85.12% kvm_set_msr_ignored_check
svm_set_msr
kvm_set_msr_common
printk
vprintk_func
vprintk_default
vprintk_emit
console_unlock
call_console_drivers
univ8250_console_write
serial8250_console_write
uart_console_write

Reported-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230124234905.3774678-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# 11df586d 12-Dec-2022 Sean Christopherson <seanjc@google.com>

KVM: VMX: Handle NMI VM-Exits in noinstr region

Move VMX's handling of NMI VM-Exits into vmx_vcpu_enter_exit() so that
the NMI is handled prior to leaving the safety of noinstr. Handling the
NMI after leaving noinstr exposes the kernel to potential ordering
problems as an instrumentation-induced fault, e.g. #DB, #BP, #PF, etc.
will unblock NMIs when IRETing back to the faulting instruction.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>


# bec46859 05-Oct-2022 Sean Christopherson <seanjc@google.com>

KVM: x86: Track supported PERF_CAPABILITIES in kvm_caps

Track KVM's supported PERF_CAPABILITIES in kvm_caps instead of computing
the supported capabilities on the fly every time. Using kvm_caps will
also allow for future cleanups as the kvm_caps values can be used
directly in common x86 code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Like Xu <likexu@tencent.com>
Message-Id: <20221006000314.73240-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1b7a1b78 20-Sep-2022 Sean Christopherson <seanjc@google.com>

KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed

Rename and invert kvm_vcpu_latch_init() to kvm_apic_init_sipi_allowed()
so as to match the behavior of {interrupt,nmi,smi}_allowed(), and expose
the helper so that it can be used by kvm_vcpu_has_events() to determine
whether or not an INIT or SIPI is pending _and_ can be taken immediately.

Opportunistically replaced usage of the "latch" terminology with "blocked"
and/or "allowed", again to align with KVM's terminology used for all other
event types.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220921003201.1441511-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 7055fb11 30-Aug-2022 Sean Christopherson <seanjc@google.com>

KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions

Treat pending TRIPLE_FAULTS as pending exceptions. A triple fault is an
exception for all intents and purposes, it's just not tracked as such
because there's no vector associated the exception. E.g. if userspace
were to set vcpu->request_interrupt_window while running L2 and L2 hit a
triple fault, a triple fault nested VM-Exit should be synthesized to L1
before exiting to userspace with KVM_EXIT_IRQ_WINDOW_OPEN.

Link: https://lore.kernel.org/all/YoVHAIGcFgJit1qp@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-23-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 7709aba8 30-Aug-2022 Sean Christopherson <seanjc@google.com>

KVM: x86: Morph pending exceptions to pending VM-Exits at queue time

Morph pending exceptions to pending VM-Exits (due to interception) when
the exception is queued instead of waiting until nested events are
checked at VM-Entry. This fixes a longstanding bug where KVM fails to
handle an exception that occurs during delivery of a previous exception,
KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
paging), and KVM determines that the exception is in the guest's domain,
i.e. queues the new exception for L2. Deferring the interception check
causes KVM to esclate various combinations of injected+pending exceptions
to double fault (#DF) without consulting L1's interception desires, and
ends up injecting a spurious #DF into L2.

KVM has fudged around the issue for #PF by special casing emulated #PF
injection for shadow paging, but the underlying issue is not unique to
shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
Other exceptions are affected as well, e.g. if KVM is intercepting #GP
for one of SVM's workaround or for the VMware backdoor emulation stuff.
The other cases have gone unnoticed because the #DF is spurious if and
only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
would have injected #DF anyways.

The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
if #PF injection forced a nested VM-Exit and the emulator finds itself
back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves
the async #PF in L2 mess; no need to set a magic flag and token, simply
queue a #PF nested VM-Exit.

Deal with event migration by flagging that a pending exception was queued
by userspace and check for interception at the next KVM_RUN, e.g. so that
KVM does the right thing regardless of the order in which userspace
restores nested state vs. event state.

When "getting" events from userspace, simply drop any pending excpetion
that is destined to be intercepted if there is also an injected exception
to be migrated. Ideally, KVM would migrate both events, but that would
require new ABI, and practically speaking losing the event is unlikely to
be noticed, let alone fatal. The injected exception is captured, RIP
still points at the original faulting instruction, etc... So either the
injection on the target will trigger the same intercepted exception, or
the source of the intercepted exception was transient and/or
non-deterministic, thus dropping it is ok-ish.

Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-22-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# d4963e31 30-Aug-2022 Sean Christopherson <seanjc@google.com>

KVM: x86: Make kvm_queued_exception a properly named, visible struct

Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch
in anticipation of adding a second instance in kvm_vcpu_arch to handle
exceptions that occur when vectoring an injected exception and are
morphed to VM-Exit instead of leading to #DF.

Opportunistically take advantage of the churn to rename "nr" to "vector".

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-15-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c33f6f222 07-Jun-2022 Sean Christopherson <seanjc@google.com>

KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bits

Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits
checks, into a separate helper, __kvm_is_valid_cr4(), and export only the
inner helper to vendor code in order to prevent nested VMX from calling
back into vmx_is_valid_cr4() via kvm_is_valid_cr4().

On SVM, this is a nop as SVM doesn't place any additional restrictions on
CR4.

On VMX, this is also currently a nop, but only because nested VMX is
missing checks on reserved CR4 bits for nested VM-Enter. That bug will
be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is,
but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's
restrictions on CR0/CR4. The cleanest and most intuitive way to fix the
VMXON bug is to use nested_host_cr{0,4}_valid(). If the CR4 variant
routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do
the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's
restrictions if and only if the vCPU is post-VMXON.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2f4073e0 24-May-2022 Tao Xu <tao3.xu@intel.com>

KVM: VMX: Enable Notify VM exit

There are cases that malicious virtual machines can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs.

VMM can enable notify VM exit that a VM exit generated if no event
window occurs in VM non-root mode for a specified amount of time (notify
window).

Feature enabling:
- The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
the expected notify window.
- Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
can query and enable this feature in per-VM scope. The argument is a
64bit value: bits 63:32 are used for notify window, and bits 31:0 are
for flags. Current supported flags:
- KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
window provided.
- KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
- It's safe to even set notify window to zero since an internal hardware
threshold is added to vmcs.notify_window.

VM exit handling:
- Introduce a vcpu state notify_window_exits to records the count of
notify VM exits and expose it through the debugfs.
- Notify VM exit can happen incident to delivery of a vector event.
Allow it in KVM.
- Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
bit is set.

Nested handling
- Nested notify VM exits are not supported yet. Keep the same notify
window control in vmcs02 as vmcs01, so that L1 can't escape the
restriction of notify VM exits through launching L2 VM.

Notify VM exit is defined in latest Intel Architecture Instruction Set
Extensions Programming Reference, chapter 9.2.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220524135624.22988-5-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 938c8745 24-May-2022 Sean Christopherson <seanjc@google.com>

KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings

Add kvm_caps to hold a variety of capabilites and defaults that aren't
handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
the amount of boilerplate code required to add a new feature. The vast
majority (all?) of the caps interact with vendor code and are written
only during initialization, i.e. should be tagged __read_mostly, declared
extern in x86.h, and exported.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cb00a70b 19-Jan-2022 David Matlack <dmatlack@google.com>

KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG

When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
write-protected when dirty logging is enabled on the memslot. Instead
they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
the first time and only for the specific sub-region being cleared.

Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls, rather than splitting the entire
memslot as is the case when initially-all-set is not used.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1fb85d06 31-Jan-2022 Adrian Hunter <adrian.hunter@intel.com>

x86: Share definition of __is_canonical_address()

Reduce code duplication by moving canonical address code to a common header
file.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220131072453.2839535-3-adrian.hunter@intel.com


# b2d2af7e 01-Feb-2022 Mark Rutland <mark.rutland@arm.com>

kvm/x86: rework guest entry logic

For consistency and clarity, migrate x86 over to the generic helpers for
guest timing and lockdep/RCU/tracing management, and remove the
x86-specific helpers.

Prior to this patch, the guest timing was entered in
kvm_guest_enter_irqoff() (called by svm_vcpu_enter_exit() and
svm_vcpu_enter_exit()), and was exited by the call to
vtime_account_guest_exit() within vcpu_enter_guest().

To minimize duplication and to more clearly balance entry and exit, both
entry and exit of guest timing are placed in vcpu_enter_guest(), using
the new guest_timing_{enter,exit}_irqoff() helpers. When context
tracking is used a small amount of additional time will be accounted
towards guests; tick-based accounting is unnaffected as IRQs are
disabled at this point and not enabled until after the return from the
guest.

This also corrects (benign) mis-balanced context tracking accounting
introduced in commits:

ae95f566b3d22ade ("KVM: X86: TSCDEADLINE MSR emulation fastpath")
26efe2fd92e50822 ("KVM: VMX: Handle preemption timer fastpath")

Where KVM can enter a guest multiple times, calling vtime_guest_enter()
without a corresponding call to vtime_account_guest_exit(), and with
vtime_account_system() called when vtime_account_guest() should be used.
As account_system_time() checks PF_VCPU and calls account_guest_time(),
this doesn't result in any functional problem, but is unnecessarily
confusing.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <20220201132926.3301912-4-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4732f244 11-Jan-2022 Like Xu <likexu@tencent.com>

KVM: x86: Making the module parameter of vPMU more common

The new module parameter to control PMU virtualization should apply
to Intel as well as AMD, for situations where userspace is not trusted.
If the module parameter allows PMU virtualization, there could be a
new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
PMU virtualization on a per-VM basis.

If the module parameter does not allow PMU virtualization, there
should be no userspace override, since we have no precedent for
authorizing that kind of override. If it's false, other counter-based
profiling features (such as LBR including the associated CPUID bits
if any) will not be exposed.

Change its name from "pmu" to "enable_pmu" as we have temporary
variables with the same name in our code like "struct kvm_pmu *pmu".

Fixes: b1d66dad65dc ("KVM: x86/svm: Add module param to control PMU virtualization")
Suggested-by : Jim Mattson <jmattson@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220111073823.21885-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 55749769 10-Dec-2021 David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty

When dirty ring logging is enabled, any dirty logging without an active
vCPU context will cause a kernel oops. But we've already declared that
the shared_info page doesn't get dirty tracking anyway, since it would
be kind of insane to mark it dirty every time we deliver an event channel
interrupt. Userspace is supposed to just assume it's always dirty any
time a vCPU can run or event channels are routed.

So stop using the generic kvm_write_wall_clock() and just write directly
through the gfn_to_pfn_cache that we already have set up.

We can make kvm_write_wall_clock() static in x86.c again now, but let's
not remove the 'sec_hi_ofs' argument even though it's not used yet. At
some point we *will* want to use that for KVM guests too.

Fixes: 629b5348841a ("KVM: x86/xen: update wallclock region")
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# db215756 10-Nov-2021 Sean Christopherson <seanjc@google.com>

KVM: x86: More precisely identify NMI from guest when handling PMI

Differentiate between IRQ and NMI for KVM's PMC overflow callback, which
was originally invoked in response to an NMI that arrived while the guest
was running, but was inadvertantly changed to fire on IRQs as well when
support for perf without PMU/NMI was added to KVM. In practice, this
should be a nop as the PMC overflow callback shouldn't be reached, but
it's a cheap and easy fix that also better documents the situation.

Note, this also doesn't completely prevent false positives if perf
somehow ends up calling into KVM, e.g. an NMI can arrive in host after
KVM sets its flag.

Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-12-seanjc@google.com


# 73cd107b 10-Nov-2021 Sean Christopherson <seanjc@google.com>

KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable

Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest'
variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu.
kvm_before/after_interrupt() must be called while the vCPU is loaded,
(which protects against preemption), thus kvm_running_vcpu is guaranteed
to be non-NULL when handling_intr_from_guest is non-zero.

Switching to kvm_get_running_vcpu() will allows moving KVM's perf
callbacks to generic code, and the new flag will be used in a future
patch to more precisely identify the "NMI from guest" case.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-11-seanjc@google.com


# 40e5f908 24-Nov-2021 Sean Christopherson <seanjc@google.com>

KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit

Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2. E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.

Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e34 ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b5aead00 23-May-2021 Tom Lendacky <thomas.lendacky@amd.com>

KVM: x86: Assume a 64-bit hypercall for guests with protected state

When processing a hypercall for a guest with protected state, currently
SEV-ES guests, the guest CS segment register can't be checked to
determine if the guest is in 64-bit mode. For an SEV-ES guest, it is
expected that communication between the guest and the hypervisor is
performed to shared memory using the GHCB. In order to use the GHCB, the
guest must have been in long mode, otherwise writes by the guest to the
GHCB would be encrypted and not be able to be comprehended by the
hypervisor.

Create a new helper function, is_64_bit_hypercall(), that assumes the
guest is in 64-bit mode when the guest has protected state, and returns
true, otherwise invoking is_64_bit_mode() to determine the mode. Update
the hypercall related routines to use is_64_bit_hypercall() instead of
is_64_bit_mode().

Add a WARN_ON_ONCE() to is_64_bit_mode() to catch occurences of calls to
this helper function for a guest running with protected state.

Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e0b20c770c9d0d1403f23d83e785385104211f74.1621878537.git.thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# dfd3c713 21-Oct-2021 Jim Mattson <jmattson@google.com>

kvm: x86: Remove stale declaration of kvm_no_apic_vcpu

This variable was renamed to kvm_has_noapic_vcpu in commit
6e4e3b4df4e3 ("KVM: Stop using deprecated jump label APIs").

Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20211021185449.3471763-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 65297341 09-Aug-2021 Uros Bizjak <ubizjak@gmail.com>

KVM: x86: Move declaration of kvm_spurious_fault() to x86.h

Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h,
it should never be called by anything other than low level KVM code.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 87e99d7d 22-Jun-2021 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper

Get LA57 from the role_regs, which are initialized from the vCPU even
though TDP is enabled, instead of pulling the value directly from the
vCPU when computing the guest's root_level for TDP MMUs. Note, the check
is inside an is_long_mode() statement, so that requirement is not lost.

Use role_regs even though the MMU's role is available and arguably
"better". A future commit will consolidate the guest root level logic,
and it needs access to EFER.LMA, which is not tracked in the role (it
can't be toggled on VM-Exit, unlike LA57).

Drop is_la57_mode() as there are no remaining users, and to discourage
pulling MMU state from the vCPU (in the future).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-41-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# bc908e09 04-May-2021 Sean Christopherson <seanjc@google.com>

KVM: x86: Consolidate guest enter/exit logic to common helpers

Move the enter/exit logic in {svm,vmx}_vcpu_enter_exit() to common
helpers. Opportunistically update the somewhat stale comment about the
updates needing to occur immediately after VM-Exit.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210505002735.1684165-9-seanjc@google.com


# 27b4a9c4 21-Apr-2021 Sean Christopherson <seanjc@google.com>

KVM: x86: Rename GPR accessors to make mode-aware variants the defaults

Append raw to the direct variants of kvm_register_read/write(), and
drop the "l" from the mode-aware variants. I.e. make the mode-aware
variants the default, and make the direct variants scary sounding so as
to discourage use. Accessing the full 64-bit values irrespective of
mode is rarely the desired behavior.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cb6a32c2 02-Mar-2021 Sean Christopherson <seanjc@google.com>

KVM: x86: Handle triple fault in L2 without killing L1

Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
instead of exiting to userspace, which likely will kill L1. Any flow
that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
is _not_intercepted by L1. E.g. if KVM is intercepting #GPs for the
VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
will cause KVM to emulate triple fault.

Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 648fc8ae 03-Feb-2021 Sean Christopherson <seanjc@google.com>

KVM: x86: Move nVMX's consistency check macro to common code

Move KVM's CC() macro to x86.h so that it can be reused by nSVM.
Debugging VM-Enter is as painful on SVM as it is on VMX.

Rename the more visible macro to KVM_NESTED_VMENTER_CONSISTENCY_CHECK
to avoid any collisions with the uber-concise "CC".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ecaf088f 26-Mar-2021 Dongli Zhang <dongli.zhang@oracle.com>

KVM: x86: remove unused declaration of kvm_write_tsc()

kvm_write_tsc() was renamed and made static since commit 0c899c25d754
("KVM: x86: do not attempt TSC synchronization on guest writes"). Remove
its unused declaration.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210326070334.12310-1-dongli.zhang@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 629b5348 28-Jun-2018 Joao Martins <joao.m.martins@oracle.com>

KVM: x86/xen: update wallclock region

Wallclock on Xen is written in the shared_info page.

To that purpose, export kvm_write_wall_clock() and pass on the GPA of
its location to populate the shared_info wall clock data.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>


# d89d04ab 02-Feb-2021 Paolo Bonzini <pbonzini@redhat.com>

KVM: move EXIT_FASTPATH_REENTER_GUEST to common code

Now that KVM is using static calls, calling vmx_vcpu_run and
vmx_sync_pir_to_irr does not incur anymore the cost of a
retpoline.

Therefore there is no need anymore to handle EXIT_FASTPATH_REENTER_GUEST
in vendor code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b3646477 14-Jan-2021 Jason Baron <jbaron@akamai.com>

KVM: x86: use static calls to reduce kvm_x86_ops overhead

Convert kvm_x86_ops to use static calls. Note that all kvm_x86_ops are
covered here except for 'pmu_ops and 'nested ops'.

Here are some numbers running cpuid in a loop of 1 million calls averaged
over 5 runs, measured in the vm (lower is better).

Intel Xeon 3000MHz:

|default |mitigations=off
-------------------------------------
vanilla |.671s |.486s
static call|.573s(-15%)|.458s(-6%)

AMD EPYC 2500MHz:

|default |mitigations=off
-------------------------------------
vanilla |.710s |.609s
static call|.664s(-6%) |.609s(0%)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Message-Id: <e057bf1b8a7ad15652df6eeba3f907ae758d3399.1610680941.git.jbaron@akamai.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4aa2691d 26-Jan-2021 Wei Huang <wei.huang2@amd.com>

KVM: x86: Factor out x86 instruction emulation with decoding

Move the instruction decode part out of x86_emulate_instruction() for it
to be used in other places. Also kvm_clear_exception_queue() is moved
inside the if-statement as it doesn't apply when KVM are coming back from
userspace.

Co-developed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210126081831.570253-2-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# d855066f 07-Jan-2021 Like Xu <like.xu@linux.intel.com>

KVM: VMX: read/write MSR_IA32_DEBUGCTLMSR from GUEST_IA32_DEBUGCTL

SVM already has specific handlers of MSR_IA32_DEBUGCTLMSR in the
svm_get/set_msr, so the x86 common part can be safely moved to VMX.
This allows KVM to store the bits it supports in GUEST_IA32_DEBUGCTL.

Add vmx_supported_debugctl() to refactor the throwing logic of #GP.

Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210108013704.134985-2-like.xu@linux.intel.com>
[Merge parts of Chenyi Qiang's "KVM: X86: Expose bus lock debug exception
to guest". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4683d758 01-Feb-2021 Vitaly Kuznetsov <vkuznets@redhat.com>

KVM: x86: Supplement __cr4_reserved_bits() with X86_FEATURE_PCID check

Commit 7a873e455567 ("KVM: selftests: Verify supported CR4 bits can be set
before KVM_SET_CPUID2") reveals that KVM allows to set X86_CR4_PCIDE even
when PCID support is missing:

==== Test Assertion Failure ====
x86_64/set_sregs_test.c:41: rc
pid=6956 tid=6956 - Invalid argument
1 0x000000000040177d: test_cr4_feature_bit at set_sregs_test.c:41
2 0x00000000004014fc: main at set_sregs_test.c:119
3 0x00007f2d9346d041: ?? ??:0
4 0x000000000040164d: _start at ??:?
KVM allowed unsupported CR4 bit (0x20000)

Add X86_FEATURE_PCID feature check to __cr4_reserved_bits() to make
kvm_is_valid_cr4() fail.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210201142843.108190-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 86137773 10-Dec-2020 Tom Lendacky <thomas.lendacky@amd.com>

KVM: SVM: Provide support for SEV-ES vCPU loading

An SEV-ES vCPU requires additional VMCB vCPU load/put requirements. SEV-ES
hardware will restore certain registers on VMEXIT, but not save them on
VMRUN (see Table B-3 and Table B-4 of the AMD64 APM Volume 2), so make the
following changes:

General vCPU load changes:
- During vCPU loading, perform a VMSAVE to the per-CPU SVM save area and
save the current values of XCR0, XSS and PKRU to the per-CPU SVM save
area as these registers will be restored on VMEXIT.

General vCPU put changes:
- Do not attempt to restore registers that SEV-ES hardware has already
restored on VMEXIT.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <019390e9cb5e93cd73014fa5a040c17d42588733.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 7ed9abfe 10-Dec-2020 Tom Lendacky <thomas.lendacky@amd.com>

KVM: SVM: Support string IO operations for an SEV-ES guest

For an SEV-ES guest, string-based port IO is performed to a shared
(un-encrypted) page so that both the hypervisor and guest can read or
write to it and each see the contents.

For string-based port IO operations, invoke SEV-ES specific routines that
can complete the operation using common KVM port IO support.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 8f423a80 10-Dec-2020 Tom Lendacky <thomas.lendacky@amd.com>

KVM: SVM: Support MMIO for an SEV-ES guest

For an SEV-ES guest, MMIO is performed to a shared (un-encrypted) page
so that both the hypervisor and guest can read or write to it and each
see the contents.

The GHCB specification provides software-defined VMGEXIT exit codes to
indicate a request for an MMIO read or an MMIO write. Add support to
recognize the MMIO requests and invoke SEV-ES specific routines that
can complete the MMIO operation. These routines use common KVM support
to complete the MMIO operation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <af8de55127d5bcc3253d9b6084a0144c12307d4d.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3f1a18b9 29-Oct-2020 Uros Bizjak <ubizjak@gmail.com>

KVM/VMX/SVM: Move kvm_machine_check function to x86.h

Move kvm_machine_check to x86.h to avoid two exact copies
of the same function in kvm.c and svm.c.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029135600.122392-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ee69c92b 06-Oct-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Return bool instead of int for CR4 and SREGS validity checks

Rework the common CR4 and SREGS checks to return a bool instead of an
int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to
clarify the polarity of the return value (which is effectively inverted
by this change).

No functional changed intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cc4cb017 01-Nov-2020 Maxim Levitsky <mlevitsk@redhat.com>

KVM: x86: use positive error values for msr emulation that causes #GP

Recent introduction of the userspace msr filtering added code that uses
negative error codes for cases that result in either #GP delivery to
the guest, or handled by the userspace msr filtering.

This breaks an assumption that a negative error code returned from the
msr emulation code is a semi-fatal error which should be returned
to userspace via KVM_RUN ioctl and usually kill the guest.

Fix this by reusing the already existing KVM_MSR_RET_INVALID error code,
and by adding a new KVM_MSR_RET_FILTERED error code for the
userspace filtered msrs.

Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation to userspace")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201101115523.115780-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 51de8151 25-Sep-2020 Alexander Graf <graf@amazon.com>

KVM: x86: Add infrastructure for MSR filtering

In the following commits we will add pieces of MSR filtering.
To ensure that code compiles even with the feature half-merged, let's add
a few stubs and struct definitions before the real patches start.

Signed-off-by: Alexander Graf <graf@amazon.com>

Message-Id: <20200925143422.21718-4-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9715092f 11-Sep-2020 Babu Moger <babu.moger@amd.com>

KVM: X86: Move handling of INVPCID types to x86

INVPCID instruction handling is mostly same across both VMX and
SVM. So, move the code to common x86.c.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 3f3393b3 11-Sep-2020 Babu Moger <babu.moger@amd.com>

KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.c

Handling of kvm_read/write_guest_virt*() errors can be moved to common
code. The same code can be used by both VMX and SVM.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 68ca7663 10-Sep-2020 Wanpeng Li <wanpengli@tencent.com>

KVM: LAPIC: Narrow down the kick target vCPU

The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer
fires on a different pCPU which vCPU is running on. This kick costs about
1000 clock cycles and we don't need this when injecting already-expired
timer or when using the VMX preemption timer because
kvm_lapic_expired_hv_timer() is called from the target vCPU.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1599731444-3525-6-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 89786147 10-Jul-2020 Mohammed Gamal <mgamal@redhat.com>

KVM: x86: Add helper functions for illegal GPA checking and page fault injection

This patch adds two helper functions that will be used to support virtualizing
MAXPHYADDR in both kvm-intel.ko and kvm.ko.

kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA,
while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits.

Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200710154811.418214-2-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 841c2be0 08-Jul-2020 Maxim Levitsky <mlevitsk@redhat.com>

kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the host

To avoid complex and in some cases incorrect logic in
kvm_spec_ctrl_test_value, just try the guest's given value on the host
processor instead, and if it doesn't #GP, allow the guest to set it.

One such case is when host CPU supports STIBP mitigation
but doesn't support IBRS (as is the case with some Zen2 AMD cpus),
and in this case we were giving guest #GP when it tried to use STIBP

The reason why can can do the host test is that IA32_SPEC_CTRL msr is
passed to the guest, after the guest sets it to a non zero value
for the first time (due to performance reasons),
and as as result of this, it is pointless to emulate #GP condition on
this first access, in a different way than what the host CPU does.

This is based on a patch from Sean Christopherson, who suggested this idea.

Fixes: 6441fa6178f5 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200708115731.180097-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 761e4169 07-Jul-2020 Krish Sadhukhan <krish.sadhukhan@oracle.com>

KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests

According to section "Canonicalization and Consistency Checks" in APM vol. 2
the following guest state is illegal:

"Any MBZ bit of CR3 is set."
"Any MBZ bit of CR4 is set."

Suggeted-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <1594168797-29444-3-git-send-email-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 53efe527 08-Jul-2020 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: Make CR4.VMXE reserved for the guest

CR4.VMXE is reserved unless the VMX CPUID bit is set. On Intel,
it is also tested by vmx_set_cr4, but AMD relies on kvm_valid_cr4,
so fix it.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b899c132 07-Jul-2020 Krish Sadhukhan <krish.sadhukhan@oracle.com>

KVM: x86: Create mask for guest CR4 reserved bits in kvm_update_cpuid()

Instead of creating the mask for guest CR4 reserved bits in kvm_valid_cr4(),
do it in kvm_update_cpuid() so that it can be reused instead of creating it
each time kvm_valid_cr4() is called.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <1594168797-29444-2-git-send-email-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f5f6145e 22-May-2020 Krish Sadhukhan <krish.sadhukhan@oracle.com>

KVM: x86: Move the check for upper 32 reserved bits of DR6 to separate function

Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20200522221954.32131-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6abe9c13 22-Jun-2020 Peter Xu <peterx@redhat.com>

KVM: X86: Move ignore_msrs handling upper the stack

MSR accesses can be one of:

(1) KVM internal access,
(2) userspace access (e.g., via KVM_SET_MSRS ioctl),
(3) guest access.

The ignore_msrs was previously handled by kvm_get_msr_common() and
kvm_set_msr_common(), which is the bottom of the msr access stack. It's
working in most cases, however it could dump unwanted warning messages to dmesg
even if kvm get/set the msrs internally when calling __kvm_set_msr() or
__kvm_get_msr() (e.g. kvm_cpuid()). Ideally we only want to trap cases (2)
or (3), but not (1) above.

To achieve this, move the ignore_msrs handling upper until the callers of
__kvm_get_msr() and __kvm_set_msr(). To identify the "msr missing" event, a
new return value (KVM_MSR_RET_INVALID==2) is used for that.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200622220442.21998-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 404d5d7b 28-Apr-2020 Wanpeng Li <wanpengli@tencent.com>

KVM: X86: Introduce more exit_fastpath_completion enum values

Adds a fastpath_t typedef since enum lines are a bit long, and replace
EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values.

- EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop,
but it would skip invoking the exit handler.

- EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered
without invoking the exit handler or going
back to vcpu_run

Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5a9f5443 28-Apr-2020 Wanpeng Li <wanpengli@tencent.com>

KVM: X86: Introduce kvm_vcpu_exit_request() helper

Introduce kvm_vcpu_exit_request() helper, we need to check some conditions
before enter guest again immediately, we skip invoking the exit handler and
go through full run loop if complete fastpath but there is stuff preventing
we enter guest again immediately.

Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-5-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# eeeb4f67 20-Mar-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Introduce KVM_REQ_TLB_FLUSH_CURRENT to flush current ASID

Add KVM_REQ_TLB_FLUSH_CURRENT to allow optimized TLB flushing of VMX's
EPTP/VPID contexts[*] from the KVM MMU and/or in a deferred manner, e.g.
to flush L2's context during nested VM-Enter.

Convert KVM_REQ_TLB_FLUSH to KVM_REQ_TLB_FLUSH_CURRENT in flows where
the flush is directly associated with vCPU-scoped instruction emulation,
i.e. MOV CR3 and INVPCID.

Add a comment in vmx_vcpu_load_vmcs() above its KVM_REQ_TLB_FLUSH to
make it clear that it deliberately requests a flush of all contexts.

Service any pending flush request on nested VM-Exit as it's possible a
nested VM-Exit could occur after requesting a flush for L2. Add the
same logic for nested VM-Enter even though it's _extremely_ unlikely
for flush to be pending on nested VM-Enter, but theoretically possible
(in the future) due to RSM (SMM) emulation.

[*] Intel also has an Address Space Identifier (ASID) concept, e.g.
EPTP+VPID+PCID == ASID, it's just not documented in the SDM because
the rules of invalidation are different based on which piece of the
ASID is being changed, i.e. whether the EPTP, VPID, or PCID context
must be invalidated.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-25-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# afaf0b2f 21-Mar-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection

Replace the kvm_x86_ops pointer in common x86 with an instance of the
struct to save one pointer dereference when invoking functions. Copy the
struct by value to set the ops during kvm_init().

Arbitrarily use kvm_x86_ops.hardware_enable to track whether or not the
ops have been initialized, i.e. a vendor KVM module has been loaded.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200321202603.19355-7-sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 408e9a31 05-Mar-2020 Paolo Bonzini <pbonzini@redhat.com>

KVM: CPUID: add support for supervisor states

Current CPUID 0xd enumeration code does not support supervisor
states, because KVM only supports setting IA32_XSS to zero.
Change it instead to use a new variable supported_xss, to be
set from the hardware_setup callback which is in charge of CPU
capabilities.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 615a4ae1 02-Mar-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Make kvm_mpx_supported() an inline function

Expose kvm_mpx_supported() as a static inline so that it can be inlined
in kvm_intel.ko.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# cfc48181 02-Mar-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Calculate the supported xcr0 mask at load time

Add a new global variable, supported_xcr0, to track which xcr0 bits can
be exposed to the guest instead of calculating the mask on every call.
The supported bits are constant for a given instance of KVM.

This paves the way toward eliminating the ->mpx_supported() call in
kvm_mpx_supported(), e.g. eliminates multiple retpolines in VMX's nested
VM-Enter path, and eventually toward eliminating ->mpx_supported()
altogether.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2f728d66 18-Feb-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Move kvm_emulate.h into KVM's private directory

Now that the emulation context is dynamically allocated and not embedded
in struct kvm_vcpu, move its header, kvm_emulate.h, out of the public
asm directory and into KVM's private x86 directory.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# f0ed4760 18-Feb-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Move emulation-only helpers to emulate.c

Move ctxt_virt_addr_bits() and emul_is_noncanonical_address() from x86.h
to emulate.c. This eliminates all references to struct x86_emulate_ctxt
from x86.h, and sets the stage for a future patch to stop including
kvm_emulate.h in asm/kvm_host.h.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9b5e8532 24-Jan-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Take a u64 when checking for a valid dr7 value

Take a u64 instead of an unsigned long in kvm_dr7_valid() to fix a build
warning on i386 due to right-shifting a 32-bit value by 32 when checking
for bits being set in dr7[63:32].

Alternatively, the warning could be resolved by rewriting the check to
use an i386-friendly method, but taking a u64 fixes another oddity on
32-bit KVM. Beause KVM implements natural width VMCS fields as u64s to
avoid layout issues between 32-bit and 64-bit, a devious guest can stuff
vmcs12->guest_dr7 with a 64-bit value even when both the guest and host
are 32-bit kernels. KVM eventually drops vmcs12->guest_dr7[63:32] when
propagating vmcs12->guest_dr7 to vmcs02, but ideally KVM would not rely
on that behavior for correctness.

Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Fixes: ecb697d10f70 ("KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b91991bf 15-Jan-2020 Krish Sadhukhan <krish.sadhukhan@oracle.com>

KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests

According to section "Checks on Guest Control Registers, Debug Registers, and
and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry
of nested guests:

If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7
field must be 0.

In KVM, GUEST_DR7 is set prior to the vmcs02 VM-entry by kvm_set_dr() and the
latter synthesizes a #GP if any bit in the high dword in the former is set.
Hence this field needs to be checked in software.

Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# de761ea7 15-Jan-2020 Sean Christopherson <seanjc@google.com>

KVM: x86: Perform non-canonical checks in 32-bit KVM

Remove the CONFIG_X86_64 condition from the low level non-canonical
helpers to effectively enable non-canonical checks on 32-bit KVM.
Non-canonical checks are performed by hardware if the CPU *supports*
64-bit mode, whether or not the CPU is actually in 64-bit mode is
irrelevant.

For the most part, skipping non-canonical checks on 32-bit KVM is ok-ish
because 32-bit KVM always (hopefully) drops bits 63:32 of whatever value
it's checking before propagating it to hardware, and architecturally,
the expected behavior for the guest is a bit of a grey area since the
vCPU itself doesn't support 64-bit mode. I.e. a 32-bit KVM guest can
observe the missed checks in several paths, e.g. INVVPID and VM-Enter,
but it's debatable whether or not the missed checks constitute a bug
because technically the vCPU doesn't support 64-bit mode.

The primary motivation for enabling the non-canonical checks is defense
in depth. As mentioned above, a guest can trigger a missed check via
INVVPID or VM-Enter. INVVPID is straightforward as it takes a 64-bit
virtual address as part of its 128-bit INVVPID descriptor and fails if
the address is non-canonical, even if INVVPID is executed in 32-bit PM.
Nested VM-Enter is a bit more convoluted as it requires the guest to
write natural width VMCS fields via memory accesses and then VMPTRLD the
VMCS, but it's still possible. In both cases, KVM is saved from a true
bug only because its flows that propagate values to hardware (correctly)
take "unsigned long" parameters and so drop bits 63:32 of the bad value.

Explicitly performing the non-canonical checks makes it less likely that
a bad value will be propagated to hardware, e.g. in the INVVPID case,
if __invvpid() didn't implicitly drop bits 63:32 then KVM would BUG() on
the resulting unexpected INVVPID failure due to hardware rejecting the
non-canonical address.

The only downside to enabling the non-canonical checks is that it adds a
relatively small amount of overhead, but the affected flows are not hot
paths, i.e. the overhead is negligible.

Note, KVM technically could gate the non-canonical checks on 32-bit KVM
with static_cpu_has(X86_FEATURE_LM), but on bare metal that's an even
bigger waste of code for everyone except the 0.00000000000001% of the
population running on Yonah, and nested 32-bit on 64-bit already fudges
things with respect to 64-bit CPU behavior.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Also do so in nested_vmx_check_host_state as reported by Krish. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6441fa61 20-Jan-2020 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL

If the guest is configured to have SPEC_CTRL but the host does not
(which is a nonsensical configuration but these are not explicitly
forbidden) then a host-initiated MSR write can write vmx->spec_ctrl
(respectively svm->spec_ctrl) and trigger a #GP when KVM tries to
restore the host value of the MSR. Add a more comprehensive check
for valid bits of SPEC_CTRL, covering host CPUID flags and,
since we are at it and it is more correct that way, guest CPUID
flags too.

For AMD, remove the unnecessary is_guest_mode check around setting
the MSR interception bitmap, so that the code looks the same as
for Intel.

Cc: Jim Mattson <jmattson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# a0a2260c 17-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: x86: Move bit() helper to cpuid.h

Move bit() to cpuid.h in preparation for incorporating the reverse_cpuid
array in bit() build-time assertions. Opportunistically use the BIT()
macro instead of open-coding the shift.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1e9e2622 20-Nov-2019 Wanpeng Li <wanpengli@tencent.com>

KVM: VMX: FIXED+PHYSICAL mode single target IPI fastpath

ICR and TSCDEADLINE MSRs write cause the main MSRs write vmexits in our
product observation, multicast IPIs are not as common as unicast IPI like
RESCHEDULE_VECTOR and CALL_FUNCTION_SINGLE_VECTOR etc.

This patch introduce a mechanism to handle certain performance-critical
WRMSRs in a very early stage of KVM VMExit handler.

This mechanism is specifically used for accelerating writes to x2APIC ICR
that attempt to send a virtual IPI with physical destination-mode, fixed
delivery-mode and single target. Which was found as one of the main causes
of VMExits for Linux workloads.

The reason this mechanism significantly reduce the latency of such virtual
IPIs is by sending the physical IPI to the target vCPU in a very early stage
of KVM VMExit handler, before host interrupts are enabled and before expensive
operations such as reacquiring KVM’s SRCU lock.
Latency is reduced even more when KVM is able to use APICv posted-interrupt
mechanism (which allows to deliver the virtual IPI directly to target vCPU
without the need to kick it to host).

Testing on Xeon Skylake server:

The virtual IPI latency from sender send to receiver receive reduces
more than 200+ cpu cycles.

Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 736c291c 06-Dec-2019 Sean Christopherson <seanjc@google.com>

KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM

Convert a plethora of parameters and variables in the MMU and page fault
flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
addresses. When TDP is enabled, the fault address is a guest physical
address and thus can be a 64-bit value, even when both KVM and its guest
are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
64-bit field, not a natural width field.

Using a gva_t for the fault address means KVM will incorrectly drop the
upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
translate L2 GPAs to L1 GPAs.

Opportunistically rename variables and parameters to better reflect the
dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
"addr" instead of "vaddr" when the address may be either a GVA or an L2
GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
a confusing "gpa_t gva" declaration; this also sets the stage for a
future patch to combing nonpaging_page_fault() and tdp_page_fault() with
minimal churn.

Sprinkle in a few comments to document flows where an address is known
to be a GVA and thus can be safely truncated to a 32-bit value. Add
WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
document such cases and detect bugs.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 27cbe7d6 11-Nov-2019 Liran Alon <liran.alon@oracle.com>

KVM: x86: Prevent set vCPU into INIT/SIPI_RECEIVED state when INIT are latched

Commit 4b9852f4f389 ("KVM: x86: Fix INIT signal handling in various CPU states")
fixed KVM to also latch pending LAPIC INIT event when vCPU is in VMX
operation.

However, current API of KVM_SET_MP_STATE allows userspace to put vCPU
into KVM_MP_STATE_SIPI_RECEIVED or KVM_MP_STATE_INIT_RECEIVED even when
vCPU is in VMX operation.

Fix this by introducing a util method to check if vCPU state latch INIT
signals and use it in KVM_SET_MP_STATE handler.

Fixes: 4b9852f4f389 ("KVM: x86: Fix INIT signal handling in various CPU states")
Reported-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 139a12cf 21-Oct-2019 Aaron Lewis <aaronlewis@google.com>

KVM: x86: Move IA32_XSS-swapping on VM-entry/VM-exit to common x86 code

Hoist the vendor-specific code related to loading the hardware IA32_XSS
MSR with guest/host values on VM-entry/VM-exit to common x86 code.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Change-Id: Ic6e3430833955b98eb9b79ae6715cf2a3fdd6d82
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 489cbcf0 27-Sep-2019 Sean Christopherson <seanjc@google.com>

KVM: x86: Add WARNs to detect out-of-bounds register indices

Add WARN_ON_ONCE() checks in kvm_register_{read,write}() to detect reg
values that would cause KVM to overflow vcpu->arch.regs. Change the reg
param to an 'int' to make it clear that the reg index is unverified.

Regarding the overhead of WARN_ON_ONCE(), now that all fixed GPR reads
and writes use dedicated accessors, e.g. kvm_rax_read(), the overhead
is limited to flows where the reg index is generated at runtime. And
there is at least one historical bug where KVM has generated an out-of-
bounds access to arch.regs (see commit b68f3cc7d9789, "KVM: x86: Always
use 32-bit SMRAM save state for 32-bit kernels").

Adding the WARN_ON_ONCE() protection paves the way for additional
cleanup related to kvm_reg and kvm_reg_ex.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9497e1f2 27-Aug-2019 Sean Christopherson <seanjc@google.com>

KVM: x86: Move triple fault request into RM int injection

Request triple fault in kvm_inject_realmode_interrupt() instead of
returning EMULATE_FAIL and deferring to the caller. All existing
callers request triple fault and it's highly unlikely Real Mode is
going to acquire new features. While this consolidates a small amount
of code, the real goal is to remove the last reference to EMULATE_FAIL.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 871bd034 01-Aug-2019 Sean Christopherson <seanjc@google.com>

KVM: x86: Rename access permissions cache member in struct kvm_vcpu_arch

Rename "access" to "mmio_access" to match the other MMIO cache members
and to make it more obvious that it's tracking the access permissions
for the MMIO cache.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 0c5f81da 05-Jul-2019 Wanpeng Li <wanpengli@tencent.com>

KVM: LAPIC: Inject timer interrupt via posted interrupt

Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside. There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits. This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.

In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.

The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially. Without patch

VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% )

While with patch:

VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% )

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# bf03d4f9 06-Jun-2019 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: introduce is_pae_paging

Checking for 32-bit PAE is quite common around code that fiddles with
the PDPTRs. Add a function to compress all checks into a single
invocation.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b5170063 21-May-2019 Wanpeng Li <wanpengli@tencent.com>

KVM: X86: Provide a capability to disable cstate msr read intercepts

Allow guest reads CORE cstate when exposing host CPU power management capabilities
to the guest. PKG cstate is restricted to avoid a guest to get the whole package
information in multi-tenant scenario.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 39497d76 17-Apr-2019 Sean Christopherson <seanjc@google.com>

KVM: lapic: Track lapic timer advance per vCPU

Automatically adjusting the globally-shared timer advancement could
corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
the advancement value. That could be partially fixed by using a local
variable for the arithmetic, but it would still be susceptible to a
race when setting timer_advance_adjust_done.

And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
correct calibration for a given vCPU may not apply to all vCPUs.

Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
effectively violated when finding a stable advancement takes an extended
amount of timer.

Opportunistically change the definition of lapic_timer_advance_ns to
a u32 so that it matches the style of struct kvm_timer. Explicitly
pass the param to kvm_create_lapic() so that it doesn't have to be
exposed to lapic.c, thus reducing the probability of unintentionally
using the global value instead of the per-vCPU value.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 674ea351 10-Apr-2019 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: optimize check for valid PAT value

This check will soon be done on every nested vmentry and vmexit,
"parallelize" it using bitwise operations.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 1811d979 12-Apr-2019 WANG Chao <chao.wang@ucloud.cn>

x86/kvm: move kvm_load/put_guest_xcr0 into atomic context

guest xcr0 could leak into host when MCE happens in guest mode. Because
do_machine_check() could schedule out at a few places.

For example:

kvm_load_guest_xcr0
...
kvm_x86_ops->run(vcpu) {
vmx_vcpu_run
vmx_complete_atomic_exit
kvm_machine_check
do_machine_check
do_memory_failure
memory_failure
lock_page

In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
out, host cpu has guest xcr0 loaded (0xff).

In __switch_to {
switch_fpu_finish
copy_kernel_to_fpregs
XRSTORS

If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
and tries to reinitialize fpu by restoring init fpu state. Same story as
last #GP, except we get DOUBLE FAULT this time.

Cc: stable@vger.kernel.org
Signed-off-by: WANG Chao <chao.wang@ucloud.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 361209e0 05-Feb-2019 Sean Christopherson <seanjc@google.com>

KVM: Explicitly define the "memslot update in-progress" bit

KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.

Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...

Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.

Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ddfd1730 05-Feb-2019 Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux

When installing new memslots, KVM sets bit 0 of the generation number to
indicate that an update is in-progress. Until the update is complete,
there are no guarantees as to whether a vCPU will see the old or the new
memslots. Explicity prevent caching MMIO accesses so as to avoid using
an access cached from the old memslots after the new memslots have been
installed.

Note that it is unclear whether or not disabling caching during the
update window is strictly necessary as there is no definitive
documentation as to what ordering guarantees KVM provides with respect
to updating memslots. That being said, the MMIO spte code does not
allow reusing sptes created while an update is in-progress, and the
associated documentation explicitly states:

We do not want to use an MMIO sptes created with an odd generation
number, ... If KVM is unlucky and creates an MMIO spte while the
low bit is 1, the next access to the spte will always be a cache miss.

At the very least, disabling the per-vCPU MMIO cache during updates will
make its behavior consistent with the MMIO spte behavior and
documentation.

Fixes: 56f17dd3fbc4 ("kvm: x86: fix stale mmio cache bug")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# da998b46 16-Oct-2018 Jim Mattson <jmattson@google.com>

kvm: x86: Defer setting of CR2 until #PF delivery

When exception payloads are enabled by userspace (which is not yet
possible) and a #PF is raised in L2, defer the setting of CR2 until
the #PF is delivered. This allows the L1 hypervisor to intercept the
fault before CR2 is modified.

For backwards compatibility, when exception payloads are not enabled
by userspace, kvm_multiple_exception modifies CR2 when the #PF
exception is raised.

Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c60658d1 23-Aug-2018 Sean Christopherson <seanjc@google.com>

KVM: x86: Unexport x86_emulate_instruction()

Allowing x86_emulate_instruction() to be called directly has led to
subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE
in the emulation type. While most of the blame lies on re-execute
being opt-out, exporting x86_emulate_instruction() also exposes its
cr2 parameter, which may have contributed to commit d391f1207067
("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO
when running nested") using x86_emulate_instruction() instead of
emulate_instruction() because "hey, I have a cr2!", which in turn
introduced its EMULTYPE_NO_REEXECUTE bug.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# 0447378a 20-Jun-2018 Marc Orr <marcorr@google.com>

kvm: vmx: Nested VM-entry prereqs for event inj.

This patch extends the checks done prior to a nested VM entry.
Specifically, it extends the check_vmentry_prereqs function with checks
for fields relevant to the VM-entry event injection information, as
described in the Intel SDM, volume 3.

This patch is motivated by a syzkaller bug, where a bad VM-entry
interruption information field is generated in the VMCS02, which causes
the nested VM launch to fail. Then, KVM fails to resume L1.

While KVM should be improved to correctly resume L1 execution after a
failed nested launch, this change is justified because the existing code
to resume L1 is flaky/ad-hoc and the test coverage for resuming L1 is
sparse.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Marc Orr <marcorr@google.com>
[Removed comment whose parts were describing previous revisions and the
rest was obvious from function/variable naming. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# ce14e868a 06-Jun-2018 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system

Int the next patch the emulator's .read_std and .write_std callbacks will
grow another argument, which is not needed in kvm_read_guest_virt and
kvm_write_guest_virt_system's callers. Since we have to make separate
functions, let's give the currently existing names a nicer interface, too.

Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5e62493f 16-Apr-2018 KarimAllah Ahmed <karahmed@amazon.de>

x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPI

Move DISABLE_EXITS KVM capability bits to the UAPI just like the rest of
capabilities.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# 082d06ed 03-Apr-2018 Wanpeng Li <wanpengli@tencent.com>

KVM: X86: Introduce handle_ud()

Introduce handle_ud() to handle invalid opcode, this function will be
used by later patches.

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 04140b41 22-Mar-2018 Liran Alon <liran.alon@oracle.com>

KVM: x86: Rename interrupt.pending to interrupt.injected

For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.

However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.

This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.

It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.

This patch introduce no semantics change.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# 8566ac8b 16-Mar-2018 Babu Moger <babu.moger@amd.com>

KVM: SVM: Implement pause loop exit logic in SVM

Bring the PLE(pause loop exit) logic to AMD svm driver.

While testing, we found this helping in situations where numerous
pauses are generated. Without these patches we could see continuos
VMEXITS due to pause interceptions. Tested it on AMD EPYC server with
boot parameter idle=poll on a VM with 32 vcpus to simulate extensive
pause behaviour. Here are VMEXITS in 10 seconds interval.

Pauses 810199 504
Total 882184 325415

Signed-off-by: Babu Moger <babu.moger@amd.com>
[Prevented the window from dropping below the initial value. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# c8e88717 16-Mar-2018 Babu Moger <babu.moger@amd.com>

KVM: VMX: Bring the common code to header file

This patch brings some of the code from vmx to x86.h header file. Now, we
can share this code between vmx and svm. Modified couple functions to make
it common.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# dd60d217 25-Jul-2017 Andi Kleen <ak@linux.intel.com>

KVM: x86: Fix perf timer mode IP reporting

KVM and perf have a special backdoor mechanism to report the IP for interrupts
re-executed after vm exit. This works for the NMIs that perf normally uses.

However when perf is in timer mode it doesn't work because the timer interrupt
doesn't get this special treatment. This is common when KVM is running
nested in another hypervisor which may not implement the PMU, so only
timer mode is available.

Call the functions to set up the backdoor IP also for non NMI interrupts.

I renamed the functions to set up the backdoor IP reporting to be more
appropiate for their new use. The SVM change is only compile tested.

v2: Moved the functions inline.
For the normal interrupt case the before/after functions are now
called from x86.c, not arch specific code.
For the NMI case we still need to call it in the architecture
specific code, because it's already needed in the low level *_run
functions.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
[Removed unnecessary calls from arch handle_external_intr. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# b31c114b 12-Mar-2018 Wanpeng Li <wanpengli@tencent.com>

KVM: X86: Provide a capability to disable PAUSE intercepts

Allow to disable pause loop exit/pause filtering on a per VM basis.

If some VMs have dedicated host CPUs, they won't be negatively affected
due to needlessly intercepted PAUSE instructions.

Thanks to Jan H. Schönherr's initial patch.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# caa057a2 12-Mar-2018 Wanpeng Li <wanpengli@tencent.com>

KVM: X86: Provide a capability to disable HLT intercepts

If host CPUs are dedicated to a VM, we can avoid VM exits on HLT.
This patch adds the per-VM capability to disable them.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4d5422ce 12-Mar-2018 Wanpeng Li <wanpengli@tencent.com>

KVM: X86: Provide a capability to disable MWAIT intercepts

Allowing a guest to execute MWAIT without interception enables a guest
to put a (physical) CPU into a power saving state, where it takes
longer to return from than what may be desired by the host.

Don't give a guest that power over a host by default. (Especially,
since nothing prevents a guest from using MWAIT even when it is not
advertised via CPUID.)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# c4ae60e4 12-Mar-2018 Liran Alon <liran.alon@oracle.com>

KVM: x86: Add module parameter for supporting VMware backdoor

Support access to VMware backdoor requires KVM to intercept #GP
exceptions from guest which introduce slight performance hit.
Therefore, control this support by module parameter.

Note that module parameter is exported as it should be consumed by
kvm_intel & kvm_amd to determine if they should intercept #GP or not.

This commit doesn't change semantics.
It is done as a preparation for future commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5c7d4f9a 19-Nov-2017 Liran Alon <liran.alon@oracle.com>

KVM: nVMX: Fix bug of injecting L2 exception into L1

kvm_clear_exception_queue() should clear pending exception.
This also includes exceptions which were only marked pending but not
yet injected. This is because exception.pending is used for both L1
and L2 to determine if an exception should be raised to guest.
Note that an exception which is pending but not yet injected will
be raised again once the guest will be resumed.

Consider the following scenario:
1) L0 KVM with ignore_msrs=false.
2) L1 prepare vmcs12 with the following:
a) No intercepts on MSR (MSR_BITMAP exist and is filled with 0).
b) No intercept for #GP.
c) vmx-preemption-timer is configured.
3) L1 enters into L2.
4) L2 reads an unhandled MSR that exists in MSR_BITMAP
(such as 0x1fff).

L2 RDMSR could be handled as described below:
1) L2 exits to L0 on RDMSR and calls handle_rdmsr().
2) handle_rdmsr() calls kvm_inject_gp() which sets
KVM_REQ_EVENT, exception.pending=true and exception.injected=false.
3) vcpu_enter_guest() consumes KVM_REQ_EVENT and calls
inject_pending_event() which calls vmx_check_nested_events()
which sees that exception.pending=true but
nested_vmx_check_exception() returns 0 and therefore does nothing at
this point. However let's assume it later sees vmx-preemption-timer
expired and therefore exits from L2 to L1 by calling
nested_vmx_vmexit().
4) nested_vmx_vmexit() calls prepare_vmcs12()
which calls vmcs12_save_pending_event() but it does nothing as
exception.injected is false. Also prepare_vmcs12() calls
kvm_clear_exception_queue() which does nothing as
exception.injected is already false.
5) We now return from vmx_check_nested_events() with 0 while still
having exception.pending=true!
6) Therefore inject_pending_event() continues
and we inject L2 exception to L1!...

This commit will fix above issue by changing step (4) to
clear exception.pending in kvm_clear_exception_queue().

Fixes: 664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been injected")
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# 431f5d44 29-Nov-2017 Radim Krčmář <rkrcmar@redhat.com>

KVM: x86: simplify kvm_mwait_in_guest()

If Intel/AMD implements MWAIT, we expect that it works well and only
reject known bugs; no reason to do it the other way around for minor
vendors. (Not that they are relevant ATM.)

This allows further simplification of kvm_mwait_in_guest().
And use boot_cpu_has() instead of "cpu_has(&boot_cpu_data," while at it.

Reviewed-by: Alexander Graf <agraf@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 346f48fa 29-Nov-2017 Radim Krčmář <rkrcmar@redhat.com>

KVM: x86: drop bogus MWAIT check

The check was added in some iteration while trying to fix a reported OS
X on Core 2 bug, but that bug is elsewhere.

The comment is misleading because the guest can call MWAIT with ECX = 0
even if we enforce CPUID5_ECX_INTERRUPT_BREAK; the call would have the
exactly the same effect as if the host didn't have the feature.

A problem is that a QEMU feature exposes CPUID5_ECX_INTERRUPT_BREAK on
CPUs that do not support it. Removing the check changes behavior on
last Pentium 4 lines (Presler, Dempsey, and Tulsa, which had VMX and
MONITOR while missing INTERRUPT_BREAK) when running a guest OS that uses
MWAIT without checking for its presence (QEMU doesn't expose MONITOR).

The only known OS that ignores the MONITOR flag is old Mac OS X and we
allowed it to bug on Core 2 (MWAIT used to throw #UD and only that OS
noticed), so we can save another 20 lines letting it bug on even older
CPUs. Alternatively, we can return MWAIT exiting by default and let
userspace toggle it.

Reviewed-by: Alexander Graf <agraf@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 2a140f3b 29-Nov-2017 Radim Krčmář <rkrcmar@redhat.com>

KVM: x86: prevent MWAIT in guest with buggy MONITOR

The bug prevents MWAIT from waking up after a write to the monitored
cache line.
KVM might emulate a CPU model that shouldn't have the bug, so the guest
would not employ a workaround and possibly miss wakeups.
Better to avoid the situation.

Reviewed-by: Alexander Graf <agraf@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 664f8e26 24-Aug-2017 Wanpeng Li <wanpeng.li@hotmail.com>

KVM: X86: Fix loss of exception which has not yet been injected

vmx_complete_interrupts() assumes that the exception is always injected,
so it can be dropped by kvm_clear_exception_queue(). However,
an exception cannot be injected immediately if it is: 1) originally
destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.

This patch applies to exceptions the same algorithm that is used for
NMIs, replacing exception.reinject with "exception.injected" (equivalent
to nmi_injected).

exception.pending now represents an exception that is queued and whose
side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
If exception.pending is true, the exception might result in a nested
vmexit instead, too (in which case the side effects must not be applied).

exception.injected instead represents an exception that is going to be
injected into the guest at the next vmentry.

Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# fd8cb433 24-Aug-2017 Yu Zhang <yu.c.zhang@linux.intel.com>

KVM: MMU: Expose the LA57 feature to VM.

This patch exposes 5 level page table feature to the VM.
At the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 855feb67 24-Aug-2017 Yu Zhang <yu.c.zhang@linux.intel.com>

KVM: MMU: Add 5 level EPT & Shadow page table support.

Extends the shadow paging code, so that 5 level shadow page
table can be constructed if VM is running in 5 level paging
mode.

Also extends the ept code, so that 5 level ept table can be
constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
shadow logic, KVM should still use 4 level ept table for a VM
whose physical address width is less than 48 bits, even when
the VM is running in 5 level paging mode.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
[Unconditionally reset the MMU context in kvm_cpuid_update.
Changing MAXPHYADDR invalidates the reserved bit bitmasks.
- Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9034e6e8 17-Aug-2017 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: fix use of L1 MMIO areas in nested guests

There is currently some confusion between nested and L1 GPAs. The
assignment to "direct" in kvm_mmu_page_fault tries to fix that, but
it is not enough. What this patch does is fence off the MMIO cache
completely when using shadow nested page tables, since we have neither
a GVA nor an L1 GPA to put in the cache. This also allows some
simplifications in kvm_mmu_page_fault and FNAME(page_fault).

The EPT misconfig likewise does not have an L1 GPA to pass to
kvm_io_bus_write, so that must be skipped for guest mode.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Changed comment to say "GPAs" instead of "L1's physical addresses", as
per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>


# 668fffa3 20-Apr-2017 Michael S. Tsirkin <mst@redhat.com>

kvm: better MWAIT emulation for guests

Guests that are heavy on futexes end up IPI'ing each other a lot. That
can lead to significant slowdowns and latency increase for those guests
when running within KVM.

If only a single guest is needed on a host, we have a lot of spare host
CPU time we can throw at the problem. Modern CPUs implement a feature
called "MWAIT" which allows guests to wake up sleeping remote CPUs without
an IPI - thus without an exit - at the expense of never going out of guest
context.

The decision whether this is something sensible to use should be up to the
VM admin, so to user space. We can however allow MWAIT execution on systems
that support it properly hardware wise.

This patch adds a CAP to user space and a KVM cpuid leaf to indicate
availability of native MWAIT execution. With that enabled, the worst a
guest can do is waste as many cycles as a "jmp ." would do, so it's not
a privilege problem.

We consciously do *not* expose the feature in our CPUID bitmap, as most
people will want to benefit from sleeping vCPUs to allow for over commit.

Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[agraf: fix amd, change commit message]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 108b249c 01-Sep-2016 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: introduce get_kvmclock_ns

Introduce a function that reads the exact nanoseconds value that is
provided to the guest in kvmclock. This crystallizes the notion of
kvmclock as a thin veneer over a stable TSC, that the guest will
(hopefully) convert with NTP. In other words, kvmclock is *not* a
paravirtualized host-to-guest NTP.

Drop the get_kernel_ns() function, that was used both to get the base
value of the master clock and to get the current value of kvmclock.
The former use is replaced by ktime_get_boot_ns(), the latter is
the purpose of get_kernel_ns().

This also allows KVM to provide a Hyper-V time reference counter that
is synchronized with the time that is computed from the TSC page.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 8d93c874 20-Jun-2016 Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86: move nsec_to_cycles from x86.c to x86.h

Move the inline function nsec_to_cycles from x86.c to x86.h, as
the next patch uses it from lapic.c.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 17a511f8 22-Mar-2016 Huaitong Han <huaitong.han@intel.com>

KVM, pkeys: add pkeys support for xsave state

This patch adds pkeys support for xsave state.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 52004014 25-Jan-2016 Feng Wu <feng.wu@intel.com>

KVM: x86: Use vector-hashing to deliver lowest-priority interrupts

Use vector-hashing to deliver lowest-priority interrupts, As an
example, modern Intel CPUs in server platform use this method to
handle lowest-priority interrupts.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# b51012de 22-Jan-2016 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: introduce do_shl32_div32

This is similar to the existing div_frac function, but it returns the
remainder too. Unlike div_frac, it can be used to implement long
division, e.g. (a << 64) / b for 32-bit a and b.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# d91cab78 02-Sep-2015 Dave Hansen <dave.hansen@linux.intel.com>

x86/fpu: Rename XSAVE macros

There are two concepts that have some confusing naming:
1. Extended State Component numbers (currently called
XFEATURE_BIT_*)
2. Extended State Component masks (currently called XSTATE_*)

The numbers are (currently) from 0-9. State component 3 is the
bounds registers for MPX, for instance.

But when we want to enable "state component 3", we go set a bit
in XCR0. The bit we set is 1<<3. We can check to see if a
state component feature is enabled by looking at its bit.

The current 'xfeature_bit's are at best xfeature bit _numbers_.
Calling them bits is at best inconsistent with ending the enum
list with 'XFEATURES_NR_MAX'.

This patch renames the enum to be 'xfeature'. These also
happen to be what the Intel documentation calls a "state
component".

We also want to differentiate these from the "XSTATE_*" macros.
The "XSTATE_*" macros are a mask, and we rename them to match.

These macros are reasonably widely used so this patch is a
wee bit big, but this really is just a rename.

The only non-mechanical part of this is the

s/XSTATE_EXTEND_MASK/XFEATURE_MASK_EXTEND/

We need a better name for it, but that's another patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233126.38653250@viggo.jf.intel.com
[ Ported to v4.3-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# e83d5887 03-Jul-2015 Andrey Smetanin <asmetanin@virtuozzo.com>

kvm/x86: move Hyper-V MSR's/hypercall code into hyperv.c file

This patch introduce Hyper-V related source code file - hyperv.c and
per vm and per vcpu hyperv context structures.
All Hyper-V MSR's and hypercall code moved into hyperv.c.
All Hyper-V kvm/vcpu fields moved into appropriate hyperv context
structures. Copyrights and authors information copied from x86.c
to hyperv.c.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 41dbc6bc 23-Jul-2015 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: introduce kvm_check_has_quirk

The logic of the disabled_quirks field usually results in a double
negation. Wrap it in a simple function that checks the bit and
negates it.

Based on a patch from Xiao Guangrong.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 6a39bbc5 15-Jun-2015 Xiao Guangrong <guangrong.xiao@linux.intel.com>

KVM: MTRR: do not map huge page for non-consistent range

Based on Intel's SDM, mapping huge page which do not have consistent
memory cache for each 4k page will cause undefined behavior

In order to avoiding this kind of undefined behavior, we force to use
4k pages under this case

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 19efffa2 15-Jun-2015 Xiao Guangrong <guangrong.xiao@linux.intel.com>

KVM: MTRR: sort variable MTRRs

Sort all valid variable MTRRs based on its base address, it will help us to
check a range to see if it's fully contained in variable MTRRs

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Fix list insertion sort, simplify var_mtrr_range_is_valid to just
test the V bit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# ff53604b 15-Jun-2015 Xiao Guangrong <guangrong.xiao@linux.intel.com>

KVM: x86: move MTRR related code to a separate file

MTRR code locates in x86.c and mmu.c so that move them to a separate file to
make the organization more clearer and it will be the place where we fully
implement vMTRR

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 74545705 27-Apr-2015 Radim Krčmář <rkrcmar@redhat.com>

KVM: x86: fix initial PAT value

PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
VMX used a wrong value (host's PAT) and while SVM used the right one,
it never got to arch.pat.

This is not an issue with QEMU as it will force the correct value.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# bab5bb39 01-Jan-2015 Nicholas Krause <xerofoify@gmail.com>

kvm: x86: Remove kvm_make_request from lapic.c

Adds a function kvm_vcpu_set_pending_timer instead of calling
kvm_make_request in lapic.c.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# d0659d94 16-Dec-2014 Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86: add option to advance tscdeadline hrtimer expiration

For the hrtimer which emulates the tscdeadline timer in the guest,
add an option to advance expiration, and busy spin on VM-entry waiting
for the actual expiration time to elapse.

This allows achieving low latencies in cyclictest (or any scenario
which requires strict timing regarding timer expiration).

Reduces average cyclictest latency from 12us to 8us
on Core i5 desktop.

Note: this option requires tuning to find the appropriate value
for a particular hardware/guest combination. One method is to measure the
average delay between apic_timer_fn and VM-entry.
Another method is to start with 1000ns, and increase the value
in say 500ns increments until avg cyclictest numbers stop decreasing.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 612263b3 22-Oct-2014 Chao Peng <chao.p.peng@linux.intel.com>

KVM: x86: Enable Intel AVX-512 for guest

Expose Intel AVX-512 feature bits to guest. Also add checks for
xcr0 AVX512 related bits according to spec:
http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4566654b 18-Sep-2014 Nadav Amit <namit@cs.technion.ac.il>

KVM: vmx: Inject #GP on invalid PAT CR

Guest which sets the PAT CR to invalid value should get a #GP. Currently, if
vmx supports loading PAT CR during entry, then the value is not checked. This
patch makes the required check in that case.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 56f17dd3 18-Aug-2014 David Matlack <dmatlack@google.com>

kvm: x86: fix stale mmio cache bug

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by using the memslot generation number
to validate the mmio cache.

Cc: stable@vger.kernel.org
Signed-off-by: David Matlack <dmatlack@google.com>
[xiaoguangrong: adjust the code to make it simpler for stable-tree fix.]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Tested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 27e6fb5d 18-Jun-2014 Nadav Amit <namit@cs.technion.ac.il>

KVM: vmx: vmx instructions handling does not consider cs.l

VMX instructions use 32-bit operands in 32-bit mode, and 64-bit operands in
64-bit mode. The current implementation is broken since it does not use the
register operands correctly, and always uses 64-bit for reads and writes.
Moreover, write to memory in vmwrite only considers long-mode, so it ignores
cs.l. This patch fixes this behavior.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 5777392e 18-Jun-2014 Nadav Amit <nadav.amit@gmail.com>

KVM: x86: check DR6/7 high-bits are clear only on long-mode

When the guest sets DR6 and DR7, KVM asserts the high 32-bits are clear, and
otherwise injects a #GP exception. This exception should only be injected only
if running in long-mode.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 4ff41732 23-Feb-2014 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: introduce kvm_supported_xcr0()

XSAVE support for KVM is already using host_xcr0 & KVM_SUPPORTED_XCR0 as
a "dynamic" version of KVM_SUPPORTED_XCR0.

However, this is not enough because the MPX bits should not be presented
to the guest unless kvm_x86_ops confirms the support. So, replace all
instances of host_xcr0 & KVM_SUPPORTED_XCR0 with a new function
kvm_supported_xcr0() that also has this check.

Note that here:

if (xstate_bv & ~KVM_SUPPORTED_XCR0)
return -EINVAL;
if (xstate_bv & ~host_cr0)
return -EINVAL;

the code is equivalent to

if ((xstate_bv & ~KVM_SUPPORTED_XCR0) ||
(xstate_bv & ~host_cr0)
return -EINVAL;

i.e. "xstate_bv & (~KVM_SUPPORTED_XCR0 | ~host_cr0)" which is in turn
equal to "xstate_bv & ~(KVM_SUPPORTED_XCR0 & host_cr0)". So we should
also use the new function there.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 390bd528 24-Feb-2014 Liu, Jinsong <jinsong.liu@intel.com>

KVM: x86: Enable Intel MPX for guest

From 44c2abca2c2eadc6f2f752b66de4acc8131880c4 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:12:31 +0800
Subject: [PATCH v5 3/3] KVM: x86: Enable Intel MPX for guest

This patch enable Intel MPX feature to guest.

Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 9ed96e87 05-Jan-2014 Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86: limit PIT timer frequency

Limit PIT timer frequency similarly to the limit applied by
LAPIC timer.

Cc: stable@kernel.org
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


# 647e23bb 02-Oct-2013 Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: mask unsupported XSAVE entries from leaf 0Dh index 0

XSAVE entries that KVM does not support are reported by
KVM_GET_SUPPORTED_CPUID for leaf 0Dh index 0 if the host supports them;
they should be left out unless there is also hypervisor support for them.

Sub-leafs are correctly handled in supported_xcr0_bit, fix index 0
to match.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>


# 8fe8ab46 29-Nov-2012 Will Auld <will.auld.intel@gmail.com>

KVM: x86: Add code to track call origin for msr assignment

In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST msr
while setting the tsc value. It is anticipated that this capability is
useful for other tasks.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 54e9818f 05-Aug-2012 Gleb Natapov <gleb@redhat.com>

KVM: use jump label to optimize checking for in kernel local apic presence

Usually all vcpus have local apic pointer initialized, so the check may
be completely skipped.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# c36fc04e 07-Mar-2012 Davidlohr Bueso <dave@gnu.org>

KVM: x86: add paging gcc optimization

Since most guests will have paging enabled for memory management, add likely() optimization
around CR0.PG checks.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 00b27a3e 23-Nov-2011 Avi Kivity <avi@redhat.com>

KVM: Move cpuid code to new file

The cpuid code has grown; put it into a separate file.

Signed-off-by: Avi Kivity <avi@redhat.com>


# bebb106a 11-Jul-2011 Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

KVM: MMU: cache mmio info on page fault path

If the page fault is caused by mmio, we can cache the mmio info, later, we do
not need to walk guest page table and quickly know it is a mmio fault while we
emulate the mmio instruction

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 6a4d7550 25-May-2011 Nadav Har'El <nyh@il.ibm.com>

KVM: nVMX: Implement VMPTRST

This patch implements the VMPTRST instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 064aea77 25-May-2011 Nadav Har'El <nyh@il.ibm.com>

KVM: nVMX: Decoding memory operands of VMX instructions

This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 71f9833b 13-Apr-2011 Serge E. Hallyn <serge@hallyn.com>

KVM: fix push of wrong eip when doing softint

When doing a soft int, we need to bump eip before pushing it to
the stack. Otherwise we'll do the int a second time.

[apw@canonical.com: merged eip update as per Jan's recommendation.]
Signed-off-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 24d1b15f 07-Dec-2010 Joerg Roedel <joerg.roedel@amd.com>

KVM: SVM: Do not report xsave in supported cpuid

To support xsave properly for the guest the SVM module need
software support for it. As long as this is not present do
not report the xsave as supported feature in cpuid.
As a side-effect this patch moves the bit() helper function
into the x86.h file so that it can be used in svm.c too.

KVM-Stable-Tag.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 63995653 19-Sep-2010 Mohammed Gamal <m.gamal005@gmail.com>

KVM: Add kvm_inject_realmode_interrupt() wrapper

This adds a wrapper function kvm_inject_realmode_interrupt() around the
emulator function emulate_int_real() to allow real mode interrupt injection.

[avi: initialize operand and address sizes before emulating interrupts]
[avi: initialize rip for real mode interrupt injection]
[avi: clear interrupt pending flag after emulating interrupt injection]

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 6539e738 10-Sep-2010 Joerg Roedel <joerg.roedel@amd.com>

KVM: MMU: Implement nested gva_to_gpa functions

This patch adds the functions to do a nested l2_gva to
l1_gpa page table walk.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 99e3e30a 20-Aug-2010 Zachary Amsden <zamsden@redhat.com>

KVM: x86: Move TSC offset writes to common code

Also, ensure that the storing of the offset and the reading of the TSC
are never preempted by taking a spinlock. While the lock is overkill
now, it is useful later in this patch series.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# a1f4d395 21-Jun-2010 Avi Kivity <avi@redhat.com>

KVM: Remove memory alias support

As advertised in feature-removal-schedule.txt. Equivalent support is provided
by overlapping memory regions.

Signed-off-by: Avi Kivity <avi@redhat.com>


# 90d83dc3 19-Apr-2010 Lai Jiangshan <laijs@cn.fujitsu.com>

KVM: use the correct RCU API for PROVE_RCU=y

The RCU/SRCU API have already changed for proving RCU usage.

I got the following dmesg when PROVE_RCU=y because we used incorrect API.
This patch coverts rcu_deference() to srcu_dereference() or family API.

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
arch/x86/kvm/mmu.c:3020 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by qemu-system-x86/8550:
#0: (&kvm->slots_lock){+.+.+.}, at: [<ffffffffa011a6ac>] kvm_set_memory_region+0x29/0x50 [kvm]
#1: (&(&kvm->mmu_lock)->rlock){+.+...}, at: [<ffffffffa012262d>] kvm_arch_commit_memory_region+0xa6/0xe2 [kvm]

stack backtrace:
Pid: 8550, comm: qemu-system-x86 Not tainted 2.6.34-rc4-tip-01028-g939eab1 #27
Call Trace:
[<ffffffff8106c59e>] lockdep_rcu_dereference+0xaa/0xb3
[<ffffffffa012f6c1>] kvm_mmu_calculate_mmu_pages+0x44/0x7d [kvm]
[<ffffffffa012263e>] kvm_arch_commit_memory_region+0xb7/0xe2 [kvm]
[<ffffffffa011a5d7>] __kvm_set_memory_region+0x636/0x6e2 [kvm]
[<ffffffffa011a6ba>] kvm_set_memory_region+0x37/0x50 [kvm]
[<ffffffffa015e956>] vmx_set_tss_addr+0x46/0x5a [kvm_intel]
[<ffffffffa0126592>] kvm_arch_vm_ioctl+0x17a/0xcf8 [kvm]
[<ffffffff810a8692>] ? unlock_page+0x27/0x2c
[<ffffffff810bf879>] ? __do_fault+0x3a9/0x3e1
[<ffffffffa011b12f>] kvm_vm_ioctl+0x364/0x38d [kvm]
[<ffffffff81060cfa>] ? up_read+0x23/0x3d
[<ffffffff810f3587>] vfs_ioctl+0x32/0xa6
[<ffffffff810f3b19>] do_vfs_ioctl+0x495/0x4db
[<ffffffff810e6b2f>] ? fget_light+0xc2/0x241
[<ffffffff810e416c>] ? do_sys_open+0x104/0x116
[<ffffffff81382d6d>] ? retint_swapgs+0xe/0x13
[<ffffffff810f3ba6>] sys_ioctl+0x47/0x6a
[<ffffffff810021db>] system_call_fastpath+0x16/0x1b

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# ff9d07a0 18-Apr-2010 Zhang, Yanmin <yanmin_zhang@linux.intel.com>

KVM: Implement perf callbacks for guest sampling

Below patch implements the perf_guest_info_callbacks on kvm.

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# f6801dff 21-Jan-2010 Avi Kivity <avi@redhat.com>

KVM: Rename vcpu->shadow_efer to efer

None of the other registers have the shadow_ prefix.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 836a1b3c 21-Jan-2010 Avi Kivity <avi@redhat.com>

KVM: Move cr0/cr4/efer related helpers to x86.h

They have more general scope than the mmu.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# 3eeb3288 21-Jan-2010 Avi Kivity <avi@redhat.com>

KVM: Add a helper for checking if the guest is in protected mode

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


# fc61b800 05-Jul-2009 Gleb Natapov <gleb@redhat.com>

KVM: Add Directed EOI support to APIC emulation

Directed EOI is specified by x2APIC, but is available even when lapic is
in xAPIC mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 66fd3f7f 11-May-2009 Gleb Natapov <gleb@redhat.com>

KVM: Do not re-execute INTn instruction.

Re-inject event instead. This is what Intel suggest. Also use correct
instruction length when re-injecting soft fault/interrupt.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 923c61bb 11-May-2009 Gleb Natapov <gleb@redhat.com>

KVM: Remove irq_pending bitmap

Only one interrupt vector can be injected from userspace irqchip at
any given time so no need to store it in a bitmap. Put it into interrupt
queue directly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 3298b75c 11-May-2009 Gleb Natapov <gleb@redhat.com>

KVM: Unprotect a page if #PF happens during NMI injection.

It is done for exception and interrupt already.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 115666df 21-Apr-2009 Gleb Natapov <gleb@redhat.com>

KVM: Remove kvm_push_irq()

No longer used.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# fe4c7b19 23-Mar-2009 Gleb Natapov <gleb@redhat.com>

KVM: reuse (pop|push)_irq from svm.c in vmx.c

The prioritized bit vector manipulation functions are useful in both vmx and
svm.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>


# 937a7eae 03-Jul-2008 Avi Kivity <avi@qumranet.com>

KVM: Add a pending interrupt queue

Similar to the exception queue, this hold interrupts that have been
accepted by the virtual processor core but not yet injected.

Not yet used.

Signed-off-by: Avi Kivity <avi@qumranet.com>


# 26eef70c 03-Jul-2008 Avi Kivity <avi@qumranet.com>

KVM: Clear exception queue before emulating an instruction

If we're emulating an instruction, either it will succeed, in which case
any previously queued exception will be spurious, or we will requeue the
same exception.

Signed-off-by: Avi Kivity <avi@qumranet.com>