#
7a36d680 |
|
27-Feb-2024 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: fix recursive deadlock in timer injection The fast-path timer delivery introduced a recursive locking deadlock when userspace configures a timer which has already expired and is delivered immediately. The call to kvm_xen_inject_timer_irqs() can call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock, which is already held in kvm_xen_vcpu_get_attr(). ============================================ WARNING: possible recursive locking detected 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G O -------------------------------------------- xen_shinfo_test/250013 is trying to acquire lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm] but task is already holding lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm] Now that the gfn_to_pfn_cache has its own self-sufficient locking, its callers no longer need to ensure serialization, so just stop taking kvm->arch.xen.xen_lock from kvm_xen_set_evtchn(). Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-6-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
66e3cf72 |
|
27-Feb-2024 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast version will always work for physical unicast", justifying its use of kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails. In fact that assumption isn't true if X2APIC isn't in use by the guest and there is (8-bit x)APIC ID aliasing. A single "unicast" destination APIC ID *may* then be delivered to multiple vCPUs. Remove the warning, and in fact it might as well just call kvm_irq_delivery_to_apic(). Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-4-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
8e62bf2b |
|
27-Feb-2024 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled Linux guests since commit b1c3497e604d ("x86/xen: Add support for HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU upcall vector when it's advertised in the Xen CPUID leaves. This upcall is injected through the guest's local APIC as an MSI, unlike the older system vector which was merely injected by the hypervisor any time the CPU was able to receive an interrupt and the upcall_pending flags is set in its vcpu_info. Effectively, that makes the per-CPU upcall edge triggered instead of level triggered, which results in the upcall being lost if the MSI is delivered when the local APIC is *disabled*. Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC for a vCPU is software enabled (in fact, on any write to the SPIV register which doesn't disable the APIC). Do the same in KVM since KVM doesn't provide a way for userspace to intervene and trap accesses to the SPIV register of a local APIC emulated by KVM. Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
451a7078 |
|
27-Feb-2024 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: improve accuracy of Xen timers A test program such as http://david.woodhou.se/timerlat.c confirms user reports that timers are increasingly inaccurate as the lifetime of a guest increases. Reporting the actual delay observed when asking for 100µs of sleep, it starts off OK on a newly-launched guest but gets worse over time, giving incorrect sleep times: root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) The biggest problem is that get_kvmclock_ns() returns inaccurate values when the guest TSC is scaled. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). The guest then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. It's non-trivial to fix get_kvmclock_ns(), although I'll come back to that. The actual guest hv_clock is per-CPU, and *theoretically* each vCPU could be running at a *different* frequency. But this patch is needed anyway because... The other issue with Xen timers was that the code would snapshot the host CLOCK_MONOTONIC at some point in time, and then... after a few interrupts may have occurred, some preemption perhaps... would also read the guest's kvmclock. Then it would proceed under the false assumption that those two happened at the *same* time. Any time which *actually* elapsed between reading the two clocks was introduced as inaccuracies in the time at which the timer fired. Fix it to use a variant of kvm_get_time_and_clockread(), which reads the host TSC just *once*, then use the returned TSC value to calculate the kvmclock (making sure to do that the way the guest would instead of making the same mistake get_kvmclock_ns() does). Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen timers still have to use CLOCK_MONOTONIC. In practice the difference between the two won't matter over the timescales involved, as the *absolute* values don't matter; just the delta. This does mean a new variant of kvm_get_time_and_clockread() is needed; called kvm_get_monotonic_and_clockread() because that's what it does. Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org [sean: massage moved comment, tweak if statement formatting] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
003d9142 |
|
15-Feb-2024 |
Paul Durrant <pdurrant@amazon.com> |
KVM: x86/xen: allow vcpu_info content to be 'safely' copied If the guest sets an explicit vcpu_info GPA then, for any of the first 32 vCPUs, the content of the default vcpu_info in the shared_info page must be copied into the new location. Because this copy may race with event delivery (which updates the 'evtchn_pending_sel' field in vcpu_info), event delivery needs to be deferred until the copy is complete. Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen that is used in atomic context if the vcpu_info PFN cache has been invalidated so that the update of vcpu_info can be deferred until the cache can be refreshed (on vCPU thread's the way back into guest context). Use this shadow if the vcpu_info cache has been *deactivated*, so that the VMM can safely copy the vcpu_info content and then re-activate the cache with the new GPA. To do this, stop considering an inactive vcpu_info cache as a hard error in kvm_xen_set_evtchn_fast(), and let the existing kvm_gpc_check() fail and kick the vCPU (if necessary). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-21-paul@xen.org [sean: add a bit of verbosity to the changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
3991f358 |
|
15-Feb-2024 |
Paul Durrant <pdurrant@amazon.com> |
KVM: x86/xen: allow vcpu_info to be mapped by fixed HVA If the guest does not explicitly set the GPA of vcpu_info structure in memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded in the shared_info page may be used. As described in a previous commit, the shared_info page is an overlay at a fixed HVA within the VMM, so in this case it also more optimal to activate the vcpu_info cache with a fixed HVA to avoid unnecessary invalidation if the guest memory layout is modified. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-14-paul@xen.org [sean: use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
b9220d32 |
|
15-Feb-2024 |
Paul Durrant <pdurrant@amazon.com> |
KVM: x86/xen: allow shared_info to be mapped by fixed HVA The shared_info page is not guest memory as such. It is a dedicated page allocated by the VMM and overlaid onto guest memory in a GFN chosen by the guest and specified in the XENMEM_add_to_physmap hypercall. The guest may even request that shared_info be moved from one GFN to another by re-issuing that hypercall, but the HVA is never going to change. Because the shared_info page is an overlay the memory slots need to be updated in response to the hypercall. However, memory slot adjustment is not atomic and, whilst all vCPUs are paused, there is still the possibility that events may be delivered (which requires the shared_info page to be updated) whilst the shared_info GPA is absent. The HVA is never absent though, so it makes much more sense to use that as the basis for the kernel's mapping. Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its availability. Don't actually advertize it yet though. That will be done in a subsequent patch, which will also add tests for the new attribute type. Also update the KVM API documentation with the new attribute and also fix it up to consistently refer to 'shared_info' (with the underscore). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-13-paul@xen.org [sean: store "hva" as a user pointer, use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
18b99e4d |
|
15-Feb-2024 |
Paul Durrant <pdurrant@amazon.com> |
KVM: x86/xen: re-initialize shared_info if guest (32/64-bit) mode is set If the shared_info PFN cache has already been initialized then the content of the shared_info page needs to be re-initialized whenever the guest mode is (re)set. Setting the guest mode is either done explicitly by the VMM via the KVM_XEN_ATTR_TYPE_LONG_MODE attribute, or implicitly when the guest writes the MSR to set up the hypercall page. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-12-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
c01c55a3 |
|
15-Feb-2024 |
Paul Durrant <pdurrant@amazon.com> |
KVM: x86/xen: separate initialization of shared_info cache and content A subsequent patch will allow shared_info to be initialized using either a GPA or a user-space (i.e. VMM) HVA. To make that patch cleaner, separate the initialization of the shared_info content from the activation of the pfncache. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-11-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
a4bff3df |
|
15-Feb-2024 |
Paul Durrant <pdurrant@amazon.com> |
KVM: pfncache: remove KVM_GUEST_USES_PFN usage As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any callers of kvm_gpc_init(), and for good reason: the implementation is incomplete/broken. And it's not clear that there will ever be a user of KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is non-trivial. Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid having to reason about broken code as __kvm_gpc_refresh() evolves. Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage check in hva_to_pfn_retry() and hence the 'usage' argument to kvm_gpc_init() are also redundant. [1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org [sean: explicitly call out that guest usage is incomplete] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
78b74638 |
|
15-Feb-2024 |
Paul Durrant <pdurrant@amazon.com> |
KVM: pfncache: add a mark-dirty helper At the moment pages are marked dirty by open-coded calls to mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot from the cache. After a subsequent patch these may not always be set so add a helper now so that caller will protected from the need to know about this detail. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-5-paul@xen.org [sean: decrease indentation, use gpa_to_gfn()] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
4438355e |
|
15-Feb-2024 |
Paul Durrant <pdurrant@amazon.com> |
KVM: x86/xen: mark guest pages dirty with the pfncache lock held Sampling gpa and memslot from an unlocked pfncache may yield inconsistent values so, since there is no problem with calling mark_page_dirty_in_slot() with the pfncache lock held, relocate the calls in kvm_xen_update_runstate_guest() and kvm_xen_inject_pending_events() accordingly. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-4-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
6d722835 |
|
02-Nov-2023 |
Paul Durrant <pdurrant@amazon.com> |
KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT Unless explicitly told to do so (by passing 'clocksource=tsc' and 'tsc=stable:socket', and then jumping through some hoops concerning potential CPU hotplug) Xen will never use TSC as its clocksource. Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set in either the primary or secondary pvclock memory areas. This has led to bugs in some guest kernels which only become evident if PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support such guests, give the VMM a new Xen HVM config flag to tell KVM to forcibly clear the bit in the Xen pvclocks. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20231102162128.2353459-1-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
3652117f |
|
22-Nov-2023 |
Christian Brauner <brauner@kernel.org> |
eventfd: simplify eventfd_signal() Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
5d6d6a7d |
|
05-Oct-2023 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86: Refine calculation of guest wall clock to use a single TSC read When populating the guest's PV wall clock information, KVM currently does a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern which should be avoided; when working with the relationship between two clocks, it's never correct to obtain one of them "now" and then the other at a slightly different "now" after an unspecified period of preemption (which might not even be under the control of the kernel, if this is an L1 hosting an L2 guest under nested virtualization). Add a kvm_get_wall_clock_epoch() function to return the guest wall clock epoch in nanoseconds using the same method as __get_kvmclock() — by using kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM clock time from a *single* TSC reading. The condition using get_cpu_tsc_khz() is equivalent to the version in __get_kvmclock() which separately checks for the CONSTANT_TSC feature or the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/bfc6d3d7cfb88c47481eabbf5a30a264c58c7789.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
409f2e92 |
|
04-Oct-2023 |
Paul Durrant <pdurrant@amazon.com> |
KVM: x86/xen: ignore the VCPU_SSHOTTMR_future flag Upstream Xen now ignores _VCPU_SSHOTTMR_future[1], since the only guest kernel ever to use it was buggy. By ignoring the flag the guest will always get a callback if it sets a negative timeout which upstream Xen has determined not to cause problems for any guest setting the flag. [1] https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=19c6cbd909 Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20231004174628.2073263-1-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
77c9b9de |
|
30-Sep-2023 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Use fast path for Xen timer delivery Most of the time there's no need to kick the vCPU and deliver the timer event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() directly from the timer callback, and only fall back to the slow path if delivering the timer would block, i.e. if kvm_xen_set_evtchn_fast() returns -EWOULDBLOCK. If delivery fails for any other reason, do nothing and just let it fail silently, as that is what the slow path would end up doing anyways. This gives a significant improvement in timer latency testing (using nanosleep() for various periods and then measuring the actual time elapsed). However, there was a reason[1] the fast path was dropped when this support was first added. The current code holds vcpu->mutex for all operations on the kvm->arch.timer_expires field, and the fast path introduces a potential race condition. Avoid that race by ensuring the hrtimer is (temporarily) cancelled before making changes in kvm_xen_start_timer(), and also when reading the values out for KVM_XEN_VCPU_ATTR_TYPE_TIMER. [1] https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/f21ee3bd852761e7808240d4ecaec3013c649dc7.camel@infradead.org [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
f422f853 |
|
06-Jan-2023 |
Paul Durrant <pdurrant@amazon.com> |
KVM: x86/xen: update Xen CPUID Leaf 4 (tsc info) sub-leaves, if present The scaling information in subleaf 1 should match the values set by KVM in the 'vcpu_info' sub-structure 'time_info' (a.k.a. pvclock_vcpu_time_info) which is shared with the guest, but is not directly available to the VMM. The offset values are not set since a TSC offset is already applied. The TSC frequency should also be set in sub-leaf 2. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230106103600.528-3-pdurrant@amazon.com Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
8d20bd63 |
|
30-Nov-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Unify pr_fmt to use module name for all KVM modules Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks use consistent formatting across common x86, Intel, and AMD code. In addition to providing consistent print formatting, using KBUILD_MODNAME, e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and SGX and ...) as technologies without generating weird messages, and without causing naming conflicts with other kernel code, e.g. "SEV: ", "tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems. Opportunistically move away from printk() for prints that need to be modified anyways, e.g. to drop a manual "kvm: " prefix. Opportunistically convert a few SGX WARNs that are similarly modified to WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good that they would fire repeatedly and spam the kernel log without providing unique information in each print. Note, defining pr_fmt yields undesirable results for code that uses KVM's printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's wrappers is relatively limited in KVM x86 code. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paul Durrant <paul@xen.org> Message-Id: <20221130230934.1014142-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
310bc395 |
|
11-Jan-2023 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lock In commit 14243b387137a ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery") the clever version of me left some helpful notes for those who would come after him: /* * For the irqfd workqueue, using the main kvm->lock mutex is * fine since this function is invoked from kvm_set_irq() with * no other lock held, no srcu. In future if it will be called * directly from a vCPU thread (e.g. on hypercall for an IPI) * then it may need to switch to using a leaf-node mutex for * serializing the shared_info mapping. */ mutex_lock(&kvm->lock); In commit 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") the other version of me ran straight past that comment without reading it, and introduced a potential deadlock by taking vcpu->mutex and kvm->lock in the wrong order. Solve this as originally suggested, by adding a leaf-node lock in the Xen state rather than using kvm->lock for it. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-4-dwmw2@infradead.org> [Rebase, add docs. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
bbe17c62 |
|
11-Jan-2023 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest() The kvm_xen_update_runstate_guest() function can be called when the vCPU is being scheduled out, from a preempt notifier. It *opportunistically* updates the runstate area in the guest memory, if the gfn_to_pfn_cache which caches the appropriate address is still valid. If there is *contention* when it attempts to obtain gpc->lock, then locking inside the priority inheritance checks may cause a deadlock. Lockdep reports: [13890.148997] Chain exists of: &gpc->lock --> &p->pi_lock --> &rq->__lock [13890.149002] Possible unsafe locking scenario: [13890.149003] CPU0 CPU1 [13890.149004] ---- ---- [13890.149005] lock(&rq->__lock); [13890.149007] lock(&p->pi_lock); [13890.149009] lock(&rq->__lock); [13890.149011] lock(&gpc->lock); [13890.149013] *** DEADLOCK *** In the general case, if there's contention for a read lock on gpc->lock, that's going to be because something else is either invalidating or revalidating the cache. Either way, we've raced with seeing it in an invalid state, in which case we would have aborted the opportunistic update anyway. So in the 'atomic' case when called from the preempt notifier, just switch to using read_trylock() and avoid the PI handling altogether. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
23e60258 |
|
11-Jan-2023 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Fix lockdep warning on "recursive" gpc locking In commit 5ec3289b31 ("KVM: x86/xen: Compatibility fixes for shared runstate area") we declared it safe to obtain two gfn_to_pfn_cache locks at the same time: /* * The guest's runstate_info is split across two pages and we * need to hold and validate both GPCs simultaneously. We can * declare a lock ordering GPC1 > GPC2 because nothing else * takes them more than one at a time. */ However, we forgot to tell lockdep. Do so, by setting a subclass on the first lock before taking the second. Fixes: 5ec3289b31 ("KVM: x86/xen: Compatibility fixes for shared runstate area") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
a79b53aa |
|
28-Dec-2022 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET While KVM_XEN_EVTCHN_RESET is usually called with no vCPUs running, if that happened it could cause a deadlock. This is due to kvm_xen_eventfd_reset() doing a synchronize_srcu() inside a kvm->lock critical section. To avoid this, first collect all the evtchnfd objects in an array and free all of them once the kvm->lock critical section is over and th SRCU grace period has expired. Reported-by: Michal Luczaj <mhal@rbox.co> Cc: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b0305c1e |
|
25-Dec-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi These are (uint64_t)-1 magic values are a userspace ABI, allowing the shared info pages and other enlightenments to be disabled. This isn't a Xen ABI because Xen doesn't let the guest turn these off except with the full SHUTDOWN_soft_reset mechanism. Under KVM, the userspace VMM is expected to handle soft reset, and tear down the kernel parts of the enlightenments accordingly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-5-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
1c14faa5 |
|
25-Dec-2022 |
Michal Luczaj <mhal@rbox.co> |
KVM: x86/xen: Simplify eventfd IOCTLs Port number is validated in kvm_xen_setattr_evtchn(). Remove superfluous checks in kvm_xen_eventfd_assign() and kvm_xen_eventfd_update(). Signed-off-by: Michal Luczaj <mhal@rbox.co> Message-Id: <20221222203021.1944101-3-mhal@rbox.co> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-4-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
70eae030 |
|
25-Dec-2022 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports The evtchnfd structure itself must be protected by either kvm->lock or SRCU. Use the former in kvm_xen_eventfd_update(), since the lock is being taken anyway; kvm_xen_hcall_evtchn_send() instead is a reader and does not need kvm->lock, and is called in SRCU critical section from the kvm_x86_handle_exit function. It is also important to use rcu_read_{lock,unlock}() in kvm_xen_hcall_evtchn_send(), because idr_remove() will *not* use synchronize_srcu() to wait for readers to complete. Remove a superfluous if (kvm) check before calling synchronize_srcu() in kvm_xen_eventfd_deassign() where kvm has been dereferenced already. Co-developed-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
92c58965 |
|
25-Dec-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly In particular, we shouldn't assume that being contiguous in guest virtual address space means being contiguous in guest *physical* address space. In dropping the manual calls to kvm_mmu_gva_to_gpa_system(), also drop the srcu_read_lock() that was around them. All call sites are reached from kvm_xen_hypercall() which is called from the handle_exit function with the read lock already held. 536395260 ("KVM: x86/xen: handle PV timers oneshot mode") 1a65105a5 ("KVM: x86/xen: handle PV spinlocks slowpath") Fixes: 2fd6df2f2 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
385407a6 |
|
25-Dec-2022 |
Michal Luczaj <mhal@rbox.co> |
KVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page() Release page irrespectively of kvm_vcpu_write_guest() return value. Suggested-by: Paul Durrant <paul@xen.org> Fixes: 23200b7a30de ("KVM: x86/xen: intercept xen hypercalls if enabled") Signed-off-by: Michal Luczaj <mhal@rbox.co> Message-Id: <20221220151454.712165-1-mhal@rbox.co> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
4265df66 |
|
07-Nov-2022 |
Peng Hao <flyingpeng@tencent.com> |
KVM: x86: Keep the lock order consistent between SRCU and gpc spinlock Acquire SRCU before taking the gpc spinlock in wait_pending_event() so as to be consistent with all other functions that acquire both locks. It's not illegal to acquire SRCU inside a spinlock, nor is there deadlock potential, but in general it's preferable to order locks from least restrictive to most restrictive, e.g. if wait_pending_event() needed to sleep for whatever reason, it could do so while holding SRCU, but would need to drop the spinlock. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/CAPm50a++Cb=QfnjMZ2EnCj-Sb9Y4UM-=uOEtHAcjnNLCAAf-dQ@mail.gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
|
#
58f5ee5f |
|
13-Oct-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpers Drop the @gpa param from the exported check()+refresh() helpers and limit changing the cache's GPA to the activate path. All external users just feed in gpc->gpa, i.e. this is a fancy nop. Allowing users to change the GPA at check()+refresh() is dangerous as those helpers explicitly allow concurrent calls, e.g. KVM could get into a livelock scenario. It's also unclear as to what the expected behavior should be if multiple tasks attempt to refresh with different GPAs. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
0318f207 |
|
13-Oct-2022 |
Michal Luczaj <mhal@rbox.co> |
KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh() Make kvm_gpc_refresh() use kvm instance cached in gfn_to_pfn_cache. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: leave kvm_gpc_unmap() as-is] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
e308c24a |
|
13-Oct-2022 |
Michal Luczaj <mhal@rbox.co> |
KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check() Make kvm_gpc_check() use kvm instance cached in gfn_to_pfn_cache. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
8c82a0b3 |
|
13-Oct-2022 |
Michal Luczaj <mhal@rbox.co> |
KVM: Store immutable gfn_to_pfn_cache properties Move the assignment of immutable properties @kvm, @vcpu, and @usage to the initializer. Make _activate() and _deactivate() use stored values. Note, @len is also effectively immutable for most cases, but not in the case of the Xen runstate cache, which may be split across two pages and the length of the first segment will depend on its address. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: handle @len in a separate patch] Signed-off-by: Sean Christopherson <seanjc@google.com> [dwmw2: acknowledge that @len can actually change for some use cases] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
214b0a88 |
|
21-Mar-2022 |
Metin Kaya <metikaya@amazon.com> |
KVM: x86/xen: add support for 32-bit guests in SCHEDOP_poll This patch introduces compat version of struct sched_poll for SCHEDOP_poll sub-operation of sched_op hypercall, reads correct amount of data (16 bytes in 32-bit case, 24 bytes otherwise) by using new compat_sched_poll struct, copies it to sched_poll properly, and lets rest of the code run as is. Signed-off-by: Metin Kaya <metikaya@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
|
#
aba3caef |
|
13-Oct-2022 |
Michal Luczaj <mhal@rbox.co> |
KVM: Shorten gfn_to_pfn_cache function names Formalize "gpc" as the acronym and use it in function names. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
8acc3518 |
|
19-Nov-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary Torture test the cases where the runstate crosses a page boundary, and and especially the case where it's configured in 32-bit mode and doesn't, but then switching to 64-bit mode makes it go onto the second page. To simplify this, make the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST ioctl also update the guest runstate area. It already did so if the actual runstate changed, as a side-effect of kvm_xen_update_runstate(). So doing it in the plain adjustment case is making it more consistent, as well as giving us a nice way to trigger the update without actually running the vCPU again and changing the values. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
d8ba8ba4 |
|
26-Nov-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured Closer inspection of the Xen code shows that we aren't supposed to be using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall. If we randomly set the top bit of ->state_entry_time for a guest that hasn't asked for it and doesn't expect it, that could make the runtimes fail to add up and confuse the guest. Without the flag it's perfectly safe for a vCPU to read its own vcpu_runstate_info; just not for one vCPU to read *another's*. I briefly pondered adding a word for the whole set of VMASST_TYPE_* flags but the only one we care about for HVM guests is this, so it seemed a bit pointless. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221127122210.248427-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
5ec3289b |
|
18-Nov-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Compatibility fixes for shared runstate area The guest runstate area can be arbitrarily byte-aligned. In fact, even when a sane 32-bit guest aligns the overall structure nicely, the 64-bit fields in the structure end up being unaligned due to the fact that the 32-bit ABI only aligns them to 32 bits. So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE is buggy, because if it's unaligned then we can't update the whole field atomically; the low bytes might be observable before the _UPDATE bit is. Xen actually updates the *byte* containing that top bit, on its own. KVM should do the same. In addition, we cannot assume that the runstate area fits within a single page. One option might be to make the gfn_to_pfn cache cope with regions that cross a page — but getting a contiguous virtual kernel mapping of a discontiguous set of IOMEM pages is a distinctly non-trivial exercise, and it seems this is the *only* current use case for the GPC which would benefit from it. An earlier version of the runstate code did use a gfn_to_hva cache for this purpose, but it still had the single-page restriction because it used the uhva directly — because it needs to be able to do so atomically when the vCPU is being scheduled out, so it used pagefault_disable() around the accesses and didn't just use kvm_write_guest_cached() which has a fallback path. So... use a pair of GPCs for the first and potential second page covering the runstate area. We can get away with locking both at once because nothing else takes more than one GPC lock at a time so we can invent a trivial ordering rule. The common case where it's all in the same page is kept as a fast path, but in both cases, the actual guest structure (compat or not) is built up from the fields in @vx, following preset pointers to the state and times fields. The only difference is whether those pointers point to the kernel stack (in the split case) or to guest memory directly via the GPC. The fast path is also fixed to use a byte access for the XEN_RUNSTATE_UPDATE bit, then the only real difference is the dual memcpy. Finally, Xen also does write the runstate area immediately when it's configured. Flip the kvm_xen_update_runstate() and …_guest() functions and call the latter directly when the runstate area is set. This means that other ioctls which modify the runstate also write it immediately to the guest when they do so, which is also intended. Update the xen_shinfo_test to exercise the pathological case where the XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is actually in a different page to the rest of the 64-bit word. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c3f37199 |
|
14-Nov-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add CPL to Xen hypercall tracepoint Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c2b8cdfa |
|
12-Nov-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0 There are almost no hypercalls which are valid from CPL > 0, and definitely none which are handled by the kernel. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Reported-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
4ea9439f |
|
12-Nov-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Validate port number in SCHEDOP_poll We shouldn't allow guests to poll on arbitrary port numbers off the end of the event channel table. Fixes: 1a65105a5aba ("KVM: x86/xen: handle PV spinlocks slowpath") [dwmw2: my bug though; the original version did check the validity as a side-effect of an idr_find() which I ripped out in refactoring.] Reported-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
73536338 |
|
28-Oct-2022 |
Eiichi Tsukata <eiichi.tsukata@nutanix.com> |
KVM: x86/xen: Fix eventfd error handling in kvm_xen_eventfd_assign() Should not call eventfd_ctx_put() in case of error. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Reported-by: syzbot+6f0c896c5a9449a10ded@syzkaller.appspotmail.com Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Message-Id: <20221028092631.117438-1-eiichi.tsukata@nutanix.com> [Introduce new goto target instead. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
52491a38 |
|
13-Oct-2022 |
Michal Luczaj <mhal@rbox.co> |
KVM: Initialize gfn_to_pfn_cache locks in dedicated helper Move the gfn_to_pfn_cache lock initialization to another helper and call the new helper during VM/vCPU creation. There are race conditions possible due to kvm_gfn_to_pfn_cache_init()'s ability to re-initialize the cache's locks. For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock. (thread 1) | (thread 2) | kvm_xen_set_evtchn_fast | read_lock_irqsave(&gpc->lock, ...) | | kvm_gfn_to_pfn_cache_init | rwlock_init(&gpc->lock) read_unlock_irqrestore(&gpc->lock, ...) | Rename "cache_init" and "cache_destroy" to activate+deactivate to avoid implying that the cache really is destroyed/freed. Note, there more races in the newly named kvm_gpc_activate() that will be addressed separately. Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: call out that this is a bug fix] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c59fb127 |
|
20-Sep-2022 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: remove KVM_REQ_UNHALT KVM_REQ_UNHALT is now unnecessary because it is replaced by the return value of kvm_vcpu_block/kvm_vcpu_halt. Remove it. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20220921003201.1441511-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
c0368991 |
|
08-Aug-2022 |
Coleman Dietsch <dietschc@csp.edu> |
KVM: x86/xen: Stop Xen timer before changing IRQ Stop Xen timer (if it's running) prior to changing the IRQ vector and potentially (re)starting the timer. Changing the IRQ vector while the timer is still running can result in KVM injecting a garbage event, e.g. vm_xen_inject_timer_irqs() could see a non-zero xen.timer_pending from a previous timer but inject the new xen.timer_virq. Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Cc: stable@vger.kernel.org Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42 Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com Signed-off-by: Coleman Dietsch <dietschc@csp.edu> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20220808190607.323899-3-dietschc@csp.edu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
af735db3 |
|
08-Aug-2022 |
Coleman Dietsch <dietschc@csp.edu> |
KVM: x86/xen: Initialize Xen timer only once Add a check for existing xen timers before initializing a new one. Currently kvm_xen_init_timer() is called on every KVM_XEN_VCPU_ATTR_TYPE_TIMER, which is causing the following ODEBUG crash when vcpu->arch.xen.timer is already set. ODEBUG: init active (active state 0) object type: hrtimer hint: xen_timer_callbac0 RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:502 Call Trace: __debug_object_init debug_hrtimer_init debug_init hrtimer_init kvm_xen_init_timer kvm_xen_vcpu_set_attr kvm_arch_vcpu_ioctl kvm_vcpu_ioctl vfs_ioctl Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Cc: stable@vger.kernel.org Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42 Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com Signed-off-by: Coleman Dietsch <dietschc@csp.edu> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220808190607.323899-2-dietschc@csp.edu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
79f772b9 |
|
14-Jun-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor, again Read vcpu->vcpu_idx directly instead of bouncing through the one-line wrapper, kvm_vcpu_get_idx(), and drop the wrapper. The wrapper is a remnant of the original implementation and serves no purpose; remove it (again) before it gains more users. kvm_vcpu_get_idx() was removed in the not-too-distant past by commit 4eeef2424153 ("KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor"), but was unintentionally re-introduced by commit a54d806688fe ("KVM: Keep memslots in tree-based structures instead of array-based ones"), likely due to a rebase goof. The wrapper then managed to gain users in KVM's Xen code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20220614225615.3843835-1-seanjc@google.com
|
#
04c97512 |
|
06-Apr-2022 |
Like Xu <like.xu.linux@gmail.com> |
KVM: x86/xen: Remove the redundantly included header file lapic.h The header lapic.h is included more than once, remove one of them. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20220406063715.55625-2-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
1a65105a |
|
03-Mar-2022 |
Boris Ostrovsky <boris.ostrovsky@oracle.com> |
KVM: x86/xen: handle PV spinlocks slowpath Add support for SCHEDOP_poll hypercall. This implementation is optimized for polling for a single channel, which is what Linux does. Polling for multiple channels is not especially efficient (and has not been tested). PV spinlocks slow path uses this hypercall, and explicitly crash if it's not supported. [ dwmw2: Rework to use kvm_vcpu_halt(), not supported for 32-bit guests ] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-17-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
661a20fa |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND At the end of the patch series adding this batch of event channel acceleration features, finally add the feature bit which advertises them and document it all. For SCHEDOP_poll we need to wake a polling vCPU when a given port is triggered, even when it's masked — and we want to implement that in the kernel, for efficiency. So we want the kernel to know that it has sole ownership of event channel delivery. Thus, we allow userspace to make the 'promise' by setting the corresponding feature bit in its KVM_XEN_HVM_CONFIG call. As we implement SCHEDOP_poll bypass later, we will do so only if that promise has been made by userspace. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-16-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
fde0451b |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Support per-vCPU event channel upcall via local APIC Windows uses a per-vCPU vector, and it's delivered via the local APIC basically like an MSI (with associated EOI) unlike the traditional guest-wide vector which is just magically asserted by Xen (and in the KVM case by kvm_xen_has_interrupt() / kvm_cpu_get_extint()). Now that the kernel is able to raise event channel events for itself, being able to do so for Windows guests is also going to be useful. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-15-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
28d1629f |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Kernel acceleration for XENVER_version Turns out this is a fast path for PV guests because they use it to trigger the event channel upcall. So letting it bounce all the way up to userspace is not great. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-14-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
53639526 |
|
03-Mar-2022 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: handle PV timers oneshot mode If the guest has offloaded the timer virq, handle the following hypercalls for programming the timer: VCPUOP_set_singleshot_timer VCPUOP_stop_singleshot_timer set_timer_op(timestamp_ns) The event channel corresponding to the timer virq is then used to inject events once timer deadlines are met. For now we back the PV timer with hrtimer. [ dwmw2: Add save/restore, 32-bit compat mode, immediate delivery, don't check timer in kvm_vcpu_has_event() ] Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-13-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
942c2490 |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we need to be aware of the Xen CPU numbering. This looks a lot like the Hyper-V handling of vpidx, for obvious reasons. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-12-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0ec6c5c5 |
|
03-Mar-2022 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: handle PV IPI vcpu yield Cooperative Linux guests after an IPI-many may yield vcpu if any of the IPI'd vcpus were preempted (i.e. runstate is 'runnable'.) Support SCHEDOP_yield for handling yield. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-11-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
2fd6df2f |
|
03-Mar-2022 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: intercept EVTCHNOP_send from guests Userspace registers a sending @port to either deliver to an @eventfd or directly back to a local event channel port. After binding events the guest or host may wish to bind those events to a particular vcpu. This is usually done for unbound and and interdomain events. Update requests are handled via the KVM_XEN_EVTCHN_UPDATE flag. Unregistered ports are handled by the emulator. Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com> Co-developed-By: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-10-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
35025735 |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Support direct injection of event channel events This adds a KVM_XEN_HVM_EVTCHN_SEND ioctl which allows direct injection of events given an explicit { vcpu, port, priority } in precisely the same form that those fields are given in the IRQ routing table. Userspace is currently able to inject 2-level events purely by setting the bits in the shared_info and vcpu_info, but FIFO event channels are harder to deal with; we will need the kernel to take sole ownership of delivery when we support those. A patch advertising this feature with a new bit in the KVM_CAP_XEN_HVM ioctl will be added in a subsequent patch. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-9-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
8733068b |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Make kvm_xen_set_evtchn() reusable from other places Clean it up to return -errno on error consistently, while still being compatible with the return conventions for kvm_arch_set_irq_inatomic() and the kvm_set_irq() callback. We use -ENOTCONN to indicate when the port is masked. No existing users care, except that it's negative. Also allow it to optimise the vCPU lookup. Unless we abuse the lapic map, there is no quick lookup from APIC ID to a vCPU; the logic in kvm_get_vcpu_by_id() will just iterate over all vCPUs till it finds the one it wants. So do that just once and stash the result in the struct kvm_xen_evtchn for next time. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-8-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
69d413cf |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info This switches the final pvclock to kvm_setup_pvclock_pfncache() and now the old kvm_setup_pvclock_page() can be removed. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-7-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
7caf9571 |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info Currently, the fast path of kvm_xen_set_evtchn_fast() doesn't set the index bits in the target vCPU's evtchn_pending_sel, because it only has a userspace virtual address with which to do so. It just sets them in the kernel, and kvm_xen_has_interrupt() then completes the delivery to the actual vcpu_info structure when the vCPU runs. Using a gfn_to_pfn_cache allows kvm_xen_set_evtchn_fast() to do the full delivery in the common case. Clean up the fallback case too, by moving the deferred delivery out into a separate kvm_xen_inject_pending_events() function which isn't ever called in atomic contexts as __kvm_xen_has_interrupt() is. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-6-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
a795cd43 |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Use gfn_to_pfn_cache for runstate area Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-4-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
cf1d88b3 |
|
03-Mar-2022 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: Remove dirty handling from gfn_to_pfn_cache completely It isn't OK to cache the dirty status of a page in internal structures for an indefinite period of time. Any time a vCPU exits the run loop to userspace might be its last; the VMM might do its final check of the dirty log, flush the last remaining dirty pages to the destination and complete a live migration. If we have internal 'dirty' state which doesn't get flushed until the vCPU is finally destroyed on the source after migration is complete, then we have lost data because that will escape the final copy. This problem already exists with the use of kvm_vcpu_unmap() to mark pages dirty in e.g. VMX nesting. Note that the actual Linux MM already considers the page to be dirty since we have a writeable mapping of it. This is just about the KVM dirty logging. For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to track which gfn_to_pfn_caches have been used and explicitly mark the corresponding pages dirty before returning to userspace. But we would have needed external tracking of that anyway, rather than walking the full list of GPCs to find those belonging to this vCPU which are dirty. So let's rely *solely* on that external tracking, and keep it simple rather than laying a tempting trap for callers to fall into. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
d0d96121 |
|
03-Mar-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: Use enum to track if cached PFN will be used in guest and/or host Replace the guest_uses_pa and kernel_map booleans in the PFN cache code with a unified enum/bitmask. Using explicit names makes it easier to review and audit call sites. Opportunistically add a WARN to prevent passing garbage; instantating a cache without declaring its usage is either buggy or pointless. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0264a351 |
|
27-Jan-2022 |
Sean Christopherson <seanjc@google.com> |
KVM: xen: Use static_call() for invoking kvm_x86_ops hooks Use static_call() for invoking kvm_x86_ops function that already have a defined static call, mostly as a step toward having _all_ calls to kvm_x86_ops route through a static_call() in order to simplify auditing, e.g. via grep, that all functions have an entry in kvm-x86-ops.h, but also because there's no reason not to use a static_call(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
fcb732d8 |
|
25-Oct-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU There are circumstances whem kvm_xen_update_runstate_guest() should not sleep because it ends up being called from __schedule() when the vCPU is preempted: [ 222.830825] kvm_xen_update_runstate_guest+0x24/0x100 [ 222.830878] kvm_arch_vcpu_put+0x14c/0x200 [ 222.830920] kvm_sched_out+0x30/0x40 [ 222.830960] __schedule+0x55c/0x9f0 To handle this, make it use the same trick as __kvm_xen_has_interrupt(), of using the hva from the gfn_to_hva_cache directly. Then it can use pagefault_disable() around the accesses and just bail out if the page is absent (which is unlikely). I almost switched to using a gfn_to_pfn_cache here and bailing out if kvm_map_gfn() fails, like kvm_steal_time_set_preempted() does — but on closer inspection it looks like kvm_map_gfn() will *always* fail in atomic context for a page in IOMEM, which means it will silently fail to make the update every single time for such guests, AFAICT. So I didn't do it that way after all. And will probably fix that one too. Cc: stable@vger.kernel.org Fixes: 30b5c851af79 ("KVM: x86/xen: Add support for vCPU runstate information") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <b17a93e5ff4561e57b1238e3e7ccd0b613eb827e.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
adb759e5 |
|
23-Jan-2022 |
Peter Zijlstra <peterz@infradead.org> |
x86,kvm/xen: Remove superfluous .fixup usage Commit 14243b387137 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery") adds superfluous .fixup usage after the whole .fixup section was removed in commit e5eefda5aa51 ("x86: Remove .fixup section"). Fixes: 14243b387137 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-Id: <20220123124219.GH20638@worktop.programming.kicks-ass.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
55749769 |
|
10-Dec-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty When dirty ring logging is enabled, any dirty logging without an active vCPU context will cause a kernel oops. But we've already declared that the shared_info page doesn't get dirty tracking anyway, since it would be kind of insane to mark it dirty every time we deliver an event channel interrupt. Userspace is supposed to just assume it's always dirty any time a vCPU can run or event channels are routed. So stop using the generic kvm_write_wall_clock() and just write directly through the gfn_to_pfn_cache that we already have set up. We can make kvm_write_wall_clock() static in x86.c again now, but let's not remove the 'sec_hi_ofs' argument even though it's not used yet. At some point we *will* want to use that for KVM guests too. Fixes: 629b5348841a ("KVM: x86/xen: update wallclock region") Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-6-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
14243b38 |
|
10-Dec-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery This adds basic support for delivering 2 level event channels to a guest. Initially, it only supports delivery via the IRQ routing table, triggered by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast() function which will use the pre-mapped shared_info page if it already exists and is still valid, while the slow path through the irqfd_inject workqueue will remap the shared_info page if necessary. It sets the bits in the shared_info page but not the vcpu_info; that is deferred to __kvm_xen_has_interrupt() which raises the vector to the appropriate vCPU. Add a 'verbose' mode to xen_shinfo_test while adding test cases for this. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-5-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
1cfc9c4b |
|
10-Dec-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Maintain valid mapping of Xen shared_info page Use the newly reinstated gfn_to_pfn_cache to maintain a kernel mapping of the Xen shared_info page so that it can be accessed in atomic context. Note that we do not participate in dirty tracking for the shared info page and we do not explicitly mark it dirty every single tim we deliver an event channel interrupts. We wouldn't want to do that even if we *did* have a valid vCPU context with which to do so. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-4-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
b5aead00 |
|
23-May-2021 |
Tom Lendacky <thomas.lendacky@amd.com> |
KVM: x86: Assume a 64-bit hypercall for guests with protected state When processing a hypercall for a guest with protected state, currently SEV-ES guests, the guest CS segment register can't be checked to determine if the guest is in 64-bit mode. For an SEV-ES guest, it is expected that communication between the guest and the hypervisor is performed to shared memory using the GHCB. In order to use the GHCB, the guest must have been in long mode, otherwise writes by the guest to the GHCB would be encrypted and not be able to be comprehended by the hypervisor. Create a new helper function, is_64_bit_hypercall(), that assumes the guest is in 64-bit mode when the guest has protected state, and returns true, otherwise invoking is_64_bit_mode() to determine the mode. Update the hypercall related routines to use is_64_bit_hypercall() instead of is_64_bit_mode(). Add a WARN_ON_ONCE() to is_64_bit_mode() to catch occurences of calls to this helper function for a guest running with protected state. Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES") Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <e0b20c770c9d0d1403f23d83e785385104211f74.1621878537.git.thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
6a834754 |
|
15-Nov-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Use sizeof_field() instead of open-coding it Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-4-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
4e843647 |
|
15-Nov-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Fix get_attr of KVM_XEN_ATTR_TYPE_SHARED_INFO In commit 319afe68567b ("KVM: xen: do not use struct gfn_to_hva_cache") we stopped storing this in-kernel as a GPA, and started storing it as a GFN. Which means we probably should have stopped calling gpa_to_gfn() on it when userspace asks for it back. Cc: stable@vger.kernel.org Fixes: 319afe68567b ("KVM: xen: do not use struct gfn_to_hva_cache") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0985dba8 |
|
23-Oct-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Fix kvm_xen_has_interrupt() sleeping in kvm_vcpu_block() In kvm_vcpu_block, the current task is set to TASK_INTERRUPTIBLE before making a final check whether the vCPU should be woken from HLT by any incoming interrupt. This is a problem for the get_user() in __kvm_xen_has_interrupt(), which really shouldn't be sleeping when the task state has already been set. I think it's actually harmless as it would just manifest itself as a spurious wakeup, but it's causing a debug warning: [ 230.963649] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b6bcdbc9>] prepare_to_swait_exclusive+0x30/0x80 Fix the warning by turning it into an *explicit* spurious wakeup. When invoked with !task_is_running(current) (and we might as well add in_atomic() there while we're at it), just return 1 to indicate that an IRQ is pending, which will cause a wakeup and then something will call it again in a context that *can* sleep so it can fault the page back in. Cc: stable@vger.kernel.org Fixes: 40da8ccd724f ("KVM: x86/xen: Add event channel interrupt vector upcall") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <168bf8c689561da904e48e2ff5ae4713eaef9e2d.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
319afe68 |
|
03-Aug-2021 |
Paolo Bonzini <pbonzini@redhat.com> |
KVM: xen: do not use struct gfn_to_hva_cache gfn_to_hva_cache is not thread-safe, so it is usually used only within a vCPU (whose code is protected by vcpu->mutex). The Xen interface implementation has such a cache in kvm->arch, but it is not really used except to store the location of the shared info page. Replace shinfo_set and shinfo_cache with just the value that is passed via KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the initialization value is not zero anymore and therefore kvm_xen_init_vm needs to be introduced. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
27b4a9c4 |
|
21-Apr-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults Append raw to the direct variants of kvm_register_read/write(), and drop the "l" from the mode-aware variants. I.e. make the mode-aware variants the default, and make the direct variants scary sounding so as to discourage use. Accessing the full 64-bit values irrespective of mode is rarely the desired behavior. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210422022128.3464144-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
6b48fd4c |
|
21-Apr-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/xen: Drop RAX[63:32] when processing hypercall Truncate RAX to 32 bits, i.e. consume EAX, when retrieving the hypecall index for a Xen hypercall. Per Xen documentation[*], the index is EAX when the vCPU is not in 64-bit mode. [*] http://xenbits.xenproject.org/docs/sphinx-unstable/guest-guide/x86/hypercall-abi.html Fixes: 23200b7a30de ("KVM: x86/xen: intercept xen hypercalls if enabled") Cc: Joao Martins <joao.m.martins@oracle.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210422022128.3464144-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
30b5c851 |
|
28-Feb-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add support for vCPU runstate information This is how Xen guests do steal time accounting. The hypervisor records the amount of time spent in each of running/runnable/blocked/offline states. In the Xen accounting, a vCPU is still in state RUNSTATE_running while in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules does the state become RUNSTATE_blocked. In KVM this means that even when the vCPU exits the kvm_run loop, the state remains RUNSTATE_running. The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given amount of time to the blocked state and subtract it from the running state. The state_entry_time corresponds to get_kvmclock_ns() at the time the vCPU entered the current state, and the total times of all four states should always add up to state_entry_time. Co-developed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210301125309.874953-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
7d7c5f76 |
|
28-Feb-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Fix return code when clearing vcpu_info and vcpu_time_info When clearing the per-vCPU shared regions, set the return value to zero to indicate success. This was causing spurious errors to be returned to userspace on soft reset. Also add a paranoid BUILD_BUG_ON() for compat structure compatibility. Fixes: 0c165b3c01fe ("KVM: x86/xen: Allow reset of Xen attributes") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210301125309.874953-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
0c165b3c |
|
08-Feb-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Allow reset of Xen attributes In order to support Xen SHUTDOWN_soft_reset (for guest kexec, etc.) the VMM needs to be able to tear everything down and return the Xen features to a clean slate. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210208232326.1830370-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
8f014550 |
|
26-Jan-2021 |
Vitaly Kuznetsov <vkuznets@redhat.com> |
KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional Hyper-V emulation is enabled in KVM unconditionally. This is bad at least from security standpoint as it is an extra attack surface. Ideally, there should be a per-VM capability explicitly enabled by VMM but currently it is not the case and we can't mandate one without breaking backwards compatibility. We can, however, check guest visible CPUIDs and only enable Hyper-V emulation when "Hv#1" interface was exposed in HYPERV_CPUID_INTERFACE. Note, VMMs are free to act in any sequence they like, e.g. they can try to set MSRs first and CPUIDs later so we still need to allow the host to read/write Hyper-V specific MSRs unconditionally. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-14-vkuznets@redhat.com> [Add selftest vcpu_set_hv_cpuid API to avoid breaking xen_vmcall_test. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
448841f0 |
|
08-Feb-2021 |
Sean Christopherson <seanjc@google.com> |
KVM: x86/xen: Use hva_t for holding hypercall page address Use hva_t, a.k.a. unsigned long, for the local variable that holds the hypercall page address. On 32-bit KVM, gcc complains about using a u64 due to the implicit cast from a 64-bit value to a 32-bit pointer. arch/x86/kvm/xen.c: In function ‘kvm_xen_write_hypercall_page’: arch/x86/kvm/xen.c:300:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast] 300 | page = memdup_user((u8 __user *)blob_addr, PAGE_SIZE); Cc: Joao Martins <joao.m.martins@oracle.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Fixes: 23200b7a30de ("KVM: x86/xen: intercept xen hypercalls if enabled") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210208201502.1239867-1-seanjc@google.com> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
99df541d |
|
08-Feb-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Remove extra unlock in kvm_xen_hvm_set_attr() This accidentally ended up locking and then immediately unlocking kvm->lock at the beginning of the function. Fix it. Fixes: a76b9641ad1c ("KVM: x86/xen: add KVM_XEN_HVM_SET_ATTR/KVM_XEN_HVM_GET_ATTR") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210208232326.1830370-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
#
40da8ccd |
|
09-Dec-2020 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add event channel interrupt vector upcall It turns out that we can't handle event channels *entirely* in userspace by delivering them as ExtINT, because KVM is a bit picky about when it accepts ExtINT interrupts from a legacy PIC. The in-kernel local APIC has to have LVT0 configured in APIC_MODE_EXTINT and unmasked, which isn't necessarily the case for Xen guests especially on secondary CPUs. To cope with this, add kvm_xen_get_interrupt() which checks the evtchn_pending_upcall field in the Xen vcpu_info, and delivers the Xen upcall vector (configured by KVM_XEN_ATTR_TYPE_UPCALL_VECTOR) if it's set regardless of LAPIC LVT0 configuration. This gives us the minimum support we need for completely userspace-based implementation of event channels. This does mean that vcpu_enter_guest() needs to check for the evtchn_pending_upcall flag being set, because it can't rely on someone having set KVM_REQ_EVENT unless we were to add some way for userspace to do so manually. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
f2340cd9 |
|
23-Jul-2018 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: register vcpu time info region Allow the Xen emulated guest the ability to register secondary vcpu time information. On Xen guests this is used in order to be mapped to userspace and hence allow vdso gettimeofday to work. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
aa096aa0 |
|
01-Feb-2019 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: setup pvclock updates Parameterise kvm_setup_pvclock_page() a little bit so that it can be invoked for different gfn_to_hva_cache structures, and with different offsets. Then we can invoke it for the normal KVM pvclock and also for the Xen one in the vcpu_info. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
73e69a86 |
|
29-Jun-2018 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: register vcpu info The vcpu info supersedes the per vcpu area of the shared info page and the guest vcpus will use this instead. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
3e324615 |
|
02-Feb-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTR This will be used for per-vCPU setup such as runstate and vcpu_info. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
629b5348 |
|
28-Jun-2018 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: update wallclock region Wallclock on Xen is written in the shared_info page. To that purpose, export kvm_write_wall_clock() and pass on the GPA of its location to populate the shared_info wall clock data. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
13ffb97a |
|
15-Jun-2018 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: register shared_info page Add KVM_XEN_ATTR_TYPE_SHARED_INFO to allow hypervisor to know where the guest's shared info page is. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
a3833b81 |
|
03-Dec-2020 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: latch long_mode when hypercall page is set up Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
a76b9641 |
|
03-Dec-2020 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: add KVM_XEN_HVM_SET_ATTR/KVM_XEN_HVM_GET_ATTR This will be used to set up shared info pages etc. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
7d6bbebb |
|
02-Feb-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Add kvm_xen_enabled static key The code paths for Xen support are all fairly lightweight but if we hide them behind this, they're even *more* lightweight for any system which isn't actually hosting Xen guests. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
78e9878c |
|
02-Feb-2021 |
David Woodhouse <dwmw@amazon.co.uk> |
KVM: x86/xen: Move KVM_XEN_HVM_CONFIG handling to xen.c This is already more complex than the simple memcpy it originally had. Move it to xen.c with the rest of the Xen support. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
79033beb |
|
13-Jun-2018 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: Fix coexistence of Xen and Hyper-V hypercalls Disambiguate Xen vs. Hyper-V calls by adding 'orl $0x80000000, %eax' at the start of the Hyper-V hypercall page when Xen hypercalls are also enabled. That bit is reserved in the Hyper-V ABI, and those hypercall numbers will never be used by Xen (because it does precisely the same trick). Switch to using kvm_vcpu_write_guest() while we're at it, instead of open-coding it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|
#
23200b7a |
|
13-Jun-2018 |
Joao Martins <joao.m.martins@oracle.com> |
KVM: x86/xen: intercept xen hypercalls if enabled Add a new exit reason for emulator to handle Xen hypercalls. Since this means KVM owns the ABI, dispense with the facility for the VMM to provide its own copy of the hypercall pages; just fill them in directly using VMCALL/VMMCALL as we do for the Hyper-V hypercall page. This behaviour is enabled by a new INTERCEPT_HCALL flag in the KVM_XEN_HVM_CONFIG ioctl structure, and advertised by the same flag being returned from the KVM_CAP_XEN_HVM check. Rename xen_hvm_config() to kvm_xen_write_hypercall_page() and move it to the nascent xen.c while we're at it, and add a test case. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
|