History log of /freebsd-current/sys/amd64/vmm/amd/svm.c
Revision Date Author Comments
# 683ea4d2 29-Dec-2023 Vitaliy Gusev <gusev.vitaliy@gmail.com>

vmm: MTRR should be saved/restored

This fixes restoring a Linux VM if it was suspended while in the GRUB
menu.

Adding MTTR increases the kernel dump size by 256 bytes per vCPU.

Sponsored by: vStack
Reviewed by: markj, rew
Differential Revision: https://reviews.freebsd.org/D43226


# 181afaaa 07-Dec-2023 Bojan Novković <bojan.novkovic@fer.hr>

vmm: implement VM_CAP_MASK_HWINTR on AMD CPUs

This patch implements the interrupt blocking VM capability on AMD
CPUs. Implementing this capability allows the GDB stub to single-step
a virtual machine without landing inside interrupt handlers.

Reviewed by: jhb, corvink
Sponsored by: Google, Inc. (GSoC 2022)
Differential Revision: https://reviews.freebsd.org/D42299


# e3b4fe64 07-Dec-2023 Bojan Novković <bojan.novkovic@fer.hr>

vmm: implement single-stepping for AMD CPUs

This patch implements single-stepping for AMD CPUs using the RFLAGS.TF
single-stepping mechanism. The GDB stub requests single-stepping
using the VM_CAP_RFLAGS_TF capability. Setting this capability will
set the RFLAGS.TF bit on the selected vCPU, activate DB exception
intercepts, and activate POPF/PUSH instruction intercepts. The
resulting DB exception is then caught by the IDT_DB vmexit handler and
bounced to userland where it is processed by the GDB stub. This patch
also makes sure that the value of the TF bit is correctly updated and
that it is not erroneously propagated into memory. Stepping over PUSHF
will cause the vm_handle_db function to correct the pushed RFLAGS
value and stepping over POPF will update the shadowed TF bit copy.

Reviewed by: jhb
Sponsored by: Google, Inc. (GSoC 2022)
Differential Revision: https://reviews.freebsd.org/D42296


# 231eee17 07-Dec-2023 Bojan Novković <bojan.novkovic@fer.hr>

vmm: enable software breakpoints for AMD CPUs

This patch adds support for software breakpoint vmexits on AMD SVM.
It implements the VM_CAP_BPT_EXIT used to enable software breakpoints.
When enabled, breakpoint vmexits are passed to userspace where they
are handled by the GDB stub.

Reviewed by: jhb
Sponsored by: Google, Inc. (GSoC 2022)
Differential Revision: https://reviews.freebsd.org/D42295


# 78c1d174 07-Dec-2023 Bojan Novković <bojan.novkovic@fer.hr>

vmm: refactor event reflection in AMD SVM

This patch refactors AMD SVM event reflection to allow events to be
propagated to userland, rather than always reflected into the guest.

This is necessary to implement some capabilities that request VMEXITs
when a specific exception occurs (e.g. VM_CAP_BPT_EXIT).

Reviewed by: jhb
Sponsored by: Google, Inc. (GSoC 2022)
Differential Revision: https://reviews.freebsd.org/D42405


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# b10e100d 05-May-2023 Corvin Köhne <corvink@FreeBSD.org>

vmm: don't free unallocated memory

If vmx or svm is disabled in BIOS or the device isn't supported by vmm,
modinit won't allocate these state save areas. As kmem_free panics when
passing a NULL pointer to it, loading the vmm kernel module causes a
panic too.

PR: 271251
Reviewed by: markj
Fixes: 74ac712f72cfd6d7b3db3c9d3b72ccf2824aa183 ("vmm: Dynamically allocate a couple of per-CPU state save areas")
MFC after: 1 week
Sponsored by: Beckhoff Automation GmbH & Co. KG
Differential Revision: https://reviews.freebsd.org/D39974


# 74ac712f 26-Apr-2023 Mark Johnston <markj@FreeBSD.org>

vmm: Dynamically allocate a couple of per-CPU state save areas

This avoids bloating the BSS when MAXCPU is large.

No functional change intended.

PR: 269572
Reviewed by: corvink, rew
Tested by: rew
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39805


# 8104fc31 28-Feb-2023 Vitaliy Gusev <gusev.vitaliy@gmail.com>

bhyve: fix restore of kernel structs

vmx_snapshot() and svm_snapshot() do not save any data and error occurs at
resume:

Restoring kernel structs...
vm_restore_kern_struct: Kernel struct size was 0 for: vmx
Failed to restore kernel structs.

Reviewed by: corvink, markj
Fixes: 39ec056e6dbd89e26ee21d2928dbd37335de0ebc ("vmm: Rework snapshotting of CPU-specific per-vCPU data.")
MFC after: 2 weeks
Sponsored by: vStack
Differential Revision: https://reviews.freebsd.org/D38476


# 892feec2 15-Nov-2022 Corvin Köhne <corvink@FreeBSD.org>

vmm: avoid spurious rendezvous

A vcpu only checks if a rendezvous is in progress or not to decide if it
should handle a rendezvous. This could lead to spurios rendezvous where
a vcpu tries a handle a rendezvous it isn't part of. This situation is
properly handled by vm_handle_rendezvous but it could potentially
degrade the performance. Avoid that by an early check if the vcpu is
part of the rendezvous or not.

At the moment, rendezvous are only used to spin up application
processors and to send ioapic interrupts. Spinning up application
processors is done in the guest boot phase by sending INIT SIPI
sequences to single vcpus. This is known to cause spurious rendezvous
and only occurs in the boot phase. Sending ioapic interrupts is rare
because modern guest will use msi and the rendezvous is always send to
all vcpus.

Reviewed by: jhb
MFC after: 1 week
Sponsored by: Beckhoff Automation GmbH & Co. KG
Differential Revision: https://reviews.freebsd.org/D37390


# 80cb5d84 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm: Pass vcpu instead of vm and vcpuid to APIs used from CPU backends.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37162


# d3956e46 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm: Use struct vcpu in the instruction emulation code.

This passes struct vcpu down in place of struct vm and and integer
vcpu index through the in-kernel instruction emulation code. To
minimize userland disruption, helper macros are used for the vCPU
arguments passed into and through the shared instruction emulation
code.

A few other APIs used by the instruction emulation code have also been
updated to accept struct vcpu in the kernel including
vm_get/set_register and vm_inject_fault.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37161


# 3dc3d32a 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm: Use struct vcpu with the vmm_stat API.

The function callbacks still use struct vm and and vCPU index.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37157


# 950af9ff 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm: Expose struct vcpu as an opaque type.

Pass a pointer to the current struct vcpu to the vcpu_init callback
and save this pointer in the CPU-specific vcpu structures.

Add routines to fetch a struct vcpu by index from a VM and to query
the VM and vcpuid from a struct vcpu.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37156


# fca494da 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm svm: Add SVM_CTR* wrapper macros.

These macros are similar to VCPU_CTR* but accept a single svm_vcpu
pointer as the first argument instead of separate vm and vcpuid.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37153


# 869c8d19 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm: Remove the per-vm cookie argument from vmmops taking a vcpu.

This requires storing a reference to the per-vm cookie in the
CPU-specific vCPU structure. Take advantage of this new field to
remove no-longer-needed function arguments in the CPU-specific
backends. In particular, stop passing the per-vm cookie to functions
that either don't use it or only use it for KTR traces.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37152


# 1aa51504 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm: Refactor storage of CPU-dependent per-vCPU data.

Rather than storing static arrays of per-vCPU data in the CPU-specific
per-VM structure, adopt a more dynamic model similar to that used to
manage CPU-specific per-VM data.

That is, add new vmmops methods to init and cleanup a single vCPU.
The init method returns a pointer that is stored in 'struct vcpu' as a
cookie pointer. This cookie pointer is now passed to other vmmops
callbacks in place of the integer index. The index is now only used
in KTR traces and when calling back into the CPU-independent layer.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37151


# 39ec056e 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm: Rework snapshotting of CPU-specific per-vCPU data.

Previously some per-vCPU state was saved in vmmops_snapshot and other
state was saved in vmmops_vcmx_snapshot. Consolidate all per-vCPU
state into the latter routine and rename the hook to the more generic
'vcpu_snapshot'. Note that the CPU-independent per-vCPU data is still
stored in a separate blob as well as the per-vCPU local APIC data.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37150


# 19b9dd2e 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm svm: Mark all VMCB state caches dirty on vCPU restore.

Mark Johnston noticed that this was missing VMCB_CACHE_LBR. Just set
all the bits as is done in svm_run() rather than trying to clear
individual bits.

Reported by: markj
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37259


# 215d2fd5 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm svm: Refactor per-vCPU data.

- Allocate VMCBs separately to avoid excessive padding in struct
svm_vcpu.

- Allocate APIC pages dynamically directly in struct vlapic.

- Move vm_mtrr into struct svm_vcpu rather than using a separate
parallel array.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37148


# 35abc6c2 18-Nov-2022 John Baldwin <jhb@FreeBSD.org>

vmm: Use vm_get_maxcpus() instead of VM_MAXCPU in various places.

Mostly these are loops that iterate over all possible vCPU IDs for a
specific virtual machine.

Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37147


# 0bda8d3e 07-Sep-2022 Corvin Köhne <CorvinK@beckhoff.com>

vmm: permit some IPIs to be handled by userspace

Add VM_EXITCODE_IPI to permit returning unhandled IPIs to userland.
INIT and STARTUP IPIs are now returned to userland. Due to backward
compatibility reasons, a new capability is added for enabling
VM_EXITCODE_IPI.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D35623
Sponsored by: Beckhoff Automation GmbH & Co. KG


# 3fc17484 09-Sep-2022 Emmanuel Vadot <manu@FreeBSD.org>

Revert "vmm: permit some IPIs to be handled by userspace"

This reverts commit a5a918b7a906eaa88e0833eac70a15989d535b02.

This cause some problem with vm using bhyveload.

Reported by: pho, kp


# a5a918b7 07-Sep-2022 Corvin Köhne <CorvinK@beckhoff.com>

vmm: permit some IPIs to be handled by userspace

Add VM_EXITCODE_IPI to permit returning unhandled IPIs to userland.
INIT and Startup IPIs are now returned to userland. Due to backward
compatibility reasons, a new capability is added for enabling
VM_EXITCODE_IPI.

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35623
Sponsored by: Beckhoff Automation GmbH & Co. KG


# 4eadbef9 27-Jul-2022 Corvin Köhne <CorvinK@beckhoff.com>

vmm: emulate INVD by ignoring it

On physical systems the ram isn't initialized on boot. So, coreboot uses
the cache as ram in this boot phase. When exiting cache as ram, coreboot
calls INVD for making the cache consistent.

In a virtual environment ram is always initialized and the cache is
always consistent. So, we can safely ignore this call.

Reviewed by: jhb, imp
Differential Revision: https://reviews.freebsd.org/D35620
Sponsored by: Beckhoff Automation GmbH & Co. KG


# 9aa02d51 30-Jun-2022 Mihai Burcea <mihaiburcea15@gmail.com>

vmm: Fix snapshots for AMD CPUs

This patch fixes the AMD implementation for snapshotting. It removes
unnecessary vmcb fields that should not be saved and duplicates.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33431


# 3ba952e1 30-May-2022 Corvin Köhne <CorvinK@beckhoff.com>

vmm: add tunable to trap WBINVD

x86 is cache coherent. However, there are special cases where cache
coherency isn't ensured (e.g. when switching the caching mode). In these
cases, WBINVD can be used. WBINVD writes all cache lines back into main
memory and invalidates the whole cache.

Due to the invalidation of the whole cache, WBINVD is a very heavy
instruction and degrades the performance on all cores. So, we should
minimize the use of WBINVD as much as possible.

In a virtual environment, the WBINVD call is mostly useless. The guest
isn't able to break cache coherency because he can't switch the physical
cache mode. When using pci passthrough WBINVD might be useful.

Nevertheless, trapping and ignoring WBINVD is an unsafe operation. For
that reason, we implement it as tunable.

Reviewed by: jhb
Sponsored by: Beckhoff Automation GmbH & Co. KG
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D35253


# f877977a 10-Apr-2022 Robert Wing <rew@FreeBSD.org>

vmm: fix set but not used warnings


# b7924341 27-Aug-2021 Andrew Turner <andrew@FreeBSD.org>

Create sys/reg.h for the common code previously in machine/reg.h

Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).

Reviewed by: imp, markj
Sponsored by: DARPA, AFRL (original work)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19830


# 15add60d 27-Nov-2020 Peter Grehan <grehan@FreeBSD.org>

Convert vmm_ops calls to IFUNC

There is no need for these to be function pointers since they are
never modified post-module load.

Rename AMD/Intel ops to be more consistent.

Submitted by: adam_fenn.io
Reviewed by: markj, grehan
Approved by: grehan (bhyve)
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D27375


# 6f5a9606 11-Nov-2020 Mark Johnston <markj@FreeBSD.org>

vmm: Make pmap_invalidate_ept() wait synchronously for guest exits

Currently EPT TLB invalidation is done by incrementing a generation
counter and issuing an IPI to all CPUs currently running vCPU threads.
The VMM inner loop caches the most recently observed generation on each
host CPU and invalidates TLB entries before executing the VM if the
cached generation number is not the most recent value.
pmap_invalidate_ept() issues IPIs to force each vCPU to stop executing
guest instructions and reload the generation number. However, it does
not actually wait for vCPUs to exit, potentially creating a window where
guests may continue to reference stale TLB entries.

Fix the problem by bracketing guest execution with an SMR read section
which is entered before loading the invalidation generation. Then,
pmap_invalidate_ept() increments the current write sequence before
loading pm_active and sending IPIs, and polls readers to ensure that all
vCPUs potentially operating with stale TLB entries have exited before
pmap_invalidate_ept() returns.

Also ensure that unsynchronized loads of the generation counter are
wrapped with atomic(9), and stop (inconsistently) updating the
invalidation counter and pm_active bitmask with acquire semantics.

Reviewed by: grehan, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26910


# a3f2a9c5 01-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Clear the upper 32-bits of registers in x86_emulate_cpuid().

Per the Intel manuals, CPUID is supposed to unconditionally zero the
upper 32 bits of the involved (rax/rbx/rcx/rdx) registers.
Previously, the emulation would cast pointers to the 64-bit register
values down to `uint32_t`, which while properly manipulating the lower
bits, would leave any garbage in the upper bits uncleared. While no
existing guest OSes seem to stumble over this in practice, the bhyve
emulation should match x86 expectations.

This was discovered through alignment warnings emitted by gcc9, while
testing it against SmartOS/bhyve.

SmartOS bug: https://smartos.org/bugview/OS-8168
Submitted by: Patrick Mooney
Reviewed by: rgrimes
Differential Revision: https://reviews.freebsd.org/D24727


# 09860d44 15-Sep-2020 Ed Maste <emaste@FreeBSD.org>

bhyve: do not permit write access to VMCB / VMCS

Reported by: Patrick Mooney
Submitted by: jhb
Security: CVE-2020-24718


# 101d5b52 15-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

bhyve: intercept AMD SVM instructions.

Intercept and report #UD to VM on SVM/AMD in case VM tried to execute an
SVM instruction. Otherwise, SVM allows execution of them, and instructions
operate on host physical addresses despite being executed in guest mode.

Reported by: Maxime Villard <max@m00nbsd.net>
admbug: 972
CVE: CVE-2020-7467
Reviewed by: grehan, markj
Differential revision: https://reviews.freebsd.org/D26313


# 543769bf 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

amd64: clean up empty lines in .c and .h files


# 9ce875d9 23-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

amd64 pmap: LA57 AKA 5-level paging

Since LA57 was moved to the main SDM document with revision 072, it
seems that we should have a support for it, and silicons are coming.

This patch makes pmap support both LA48 and LA57 hardware. The
selection of page table level is done at startup, kernel always
receives control from loader with 4-level paging. It is not clear how
UEFI spec would adapt LA57, for instance it could hand out control in
LA57 mode sometimes.

To switch from LA48 to LA57 requires turning off long mode, requesting
LA57 in CR4, then re-entering long mode. This is somewhat delicate
and done in pmap_bootstrap_la57(). AP startup in LA57 mode is much
easier, we only need to toggle a bit in CR4 and load right value in CR3.

I decided to not change kernel map for now. Single PML5 entry is
created that points to the existing kernel_pml4 (KML4Phys) page, and a
pml5 entry to create our recursive mapping for vtopte()/vtopde().
This decision is motivated by the fact that we cannot overcommit for
KVA, so large space there is unusable until machines start providing
wider physical memory addressing. Another reason is that I do not
want to break our fragile autotuning, so the KVA expansion is not
included into this first step. Nice side effect is that minidumps are
compatible.

On the other hand, (very) large address space is definitely
immediately useful for some userspace applications.

For userspace, numbering of pte entries (or page table pages) is
always done for 5-level structures even if we operate in 4-level mode.
The pmap_is_la57() function is added to report the mode of the
specified pmap, this is done not to allow simultaneous 4-/5-levels
(which is not allowed by hw), but to accomodate for EPT which has
separate level control and in principle might not allow 5-leve EPT
despite x86 paging supports it. Anyway, it does not seems critical to
have 5-level EPT support now.

Tested by: pho (LA48 hardware)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25273


# 483d953a 04-May-2020 John Baldwin <jhb@FreeBSD.org>

Initial support for bhyve save and restore.

Save and restore (also known as suspend and resume) permits a snapshot
to be taken of a guest's state that can later be resumed. In the
current implementation, bhyve(8) creates a UNIX domain socket that is
used by bhyvectl(8) to send a request to save a snapshot (and
optionally exit after the snapshot has been taken). A snapshot
currently consists of two files: the first holds a copy of guest RAM,
and the second file holds other guest state such as vCPU register
values and device model state.

To resume a guest, bhyve(8) must be started with a matching pair of
command line arguments to instantiate the same set of device models as
well as a pointer to the saved snapshot.

While the current implementation is useful for several uses cases, it
has a few limitations. The file format for saving the guest state is
tied to the ABI of internal bhyve structures and is not
self-describing (in that it does not communicate the set of device
models present in the system). In addition, the state saved for some
device models closely matches the internal data structures which might
prove a challenge for compatibility of snapshot files across a range
of bhyve versions. The file format also does not currently support
versioning of individual chunks of state. As a result, the current
file format is not a fixed binary format and future revisions to save
and restore will break binary compatiblity of snapshot files. The
goal is to move to a more flexible format that adds versioning,
etc. and at that point to commit to providing a reasonable level of
compatibility. As a result, the current implementation is not enabled
by default. It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option
for userland builds, and the kernel option BHYVE_SHAPSHOT.

Submitted by: Mihai Tiganus, Flavius Anton, Darius Mihai
Submitted by: Elena Mihailescu, Mihai Carabas, Sergiu Weisz
Relnotes: yes
Sponsored by: University Politehnica of Bucharest
Sponsored by: Matthew Grooms (student scholarships)
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D19495


# b40598c5 15-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (4 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked). Use it in
preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Reviewed by: kib
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23625
X-Generally looks fine: jhb


# cbd03a9d 13-Dec-2019 John Baldwin <jhb@FreeBSD.org>

Support software breakpoints in the debug server on Intel CPUs.

- Allow the userland hypervisor to intercept breakpoint exceptions
(BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to
enable this feature. These exceptions are reported to userland via
a new VM_EXITCODE_BPT that includes the length of the original
breakpoint instruction. If userland wishes to pass the exception
through to the guest, it must be explicitly re-injected via
vm_inject_exception().

- Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH
pseudo-register. Injecting a BP# on Intel requires setting this to
the length of the breakpoint instruction. AMD SVM currently ignores
writes to this register (but reports success) and fails to read it.

- Rework the per-vCPU state tracked by the debug server. Rather than
a single 'stepping_vcpu' global, add a structure for each vCPU that
tracks state about that vCPU ('stepping', 'stepped', and
'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is
currently reporting an event. Event handlers for MTRAP and
breakpoint exits loop until the associated event is reported to the
debugger.

Breakpoint events are discarded if the breakpoint is not present
when a vCPU resumes in the breakpoint handler to retry submitting
the breakpoint event.

- Maintain a linked-list of active breakpoints in response to the GDB
'Z0' and 'z0' packets.

Reviewed by: markj (earlier version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D20309


# e08087ee 28-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Use get_pcpu() to fetch the current CPU's pcpu pointer.

This avoids encoding knowledge about how pcpu objects are allocated and is
also a few instructions shorter.

MFC after: 2 weeks


# 13a7c4d4 07-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Use designated initializers for vmm_ops.

MFC after: 3 days


# a488c9c9 25-Apr-2019 Rodney W. Grimes <rgrimes@FreeBSD.org>

Add accessor function for vm->maxcpus

Replace most VM_MAXCPU constant useses with an accessor function to
vm->maxcpus which for now is initialized and kept at the value of
VM_MAXCPUS.

This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
work from D10070 to adjust it for the cpu topology changes that
occured in r332298

Submitted by: Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: bde (mentor), jhb (maintainer)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D18755


# de679f6e 15-Oct-2018 John Baldwin <jhb@FreeBSD.org>

Reload the LDT selector after an AMD-v #VMEXIT.

cpu_switch() always reloads the LDT, so this can only affect the
hypervisor process itself. Fix this by explicitly reloading the host
LDT selector after each #VMEXIT. The stock bhyve process on FreeBSD
never uses a custom LDT, so this change is cosmetic.

Reviewed by: kib
Tested by: Mike Tancsa <mike@sentex.net>
Approved by: re (gjb)
MFC after: 2 weeks


# ebc3c37c 13-Jun-2018 Marcelo Araujo <araujo@FreeBSD.org>

Add SPDX tags to vmm(4).

MFC after: 4 weeks.
Sponsored by: iXsystems Inc.


# 9e2154ff 21-May-2018 John Baldwin <jhb@FreeBSD.org>

Cleanups related to debug exceptions on x86.

- Add constants for fields in DR6 and the reserved fields in DR7. Use
these constants instead of magic numbers in most places that use DR6
and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
recommended by the SDM dating all the way back to the 386. This
allows debuggers to determine the cause of each exception. For
kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
to other parts of the handler (namely, user_dbreg_trap()). For user
traps, wait until after trapsignal to clear DR6 so that userland
debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
in trapsignal().

Reviewed by: kib, rgrimes
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15189


# fc276d92 06-Apr-2018 John Baldwin <jhb@FreeBSD.org>

Add a way to temporarily suspend and resume virtual CPUs.

This is used as part of implementing run control in bhyve's debug
server. The hypervisor now maintains a set of "debugged" CPUs.
Attempting to run a debugged CPU will fail to execute any guest
instructions and will instead report a VM_EXITCODE_DEBUG exit to
the userland hypervisor. Virtual CPUs are placed into the debugged
state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl).
Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl).

The debug server suspends virtual CPUs when it wishes them to stop
executing in the guest (for example, when a debugger attaches to the
server). The debug server can choose to resume only a subset of CPUs
(for example, when single stepping) or it can choose to resume all
CPUs. The debug server must explicitly mark a CPU as resumed via
vm_resume_cpu() before the virtual CPU will successfully execute any
guest instructions.

Reviewed by: avg, grehan
Tested on: Intel (jhb), AMD (avg)
Differential Revision: https://reviews.freebsd.org/D14466


# 7b394c10 16-Feb-2018 Andriy Gapon <avg@FreeBSD.org>

move vintr_intercept_enabled under INVARIANTS

The function is not used outside of INVARIANTS since r328622.

MFC after: 1 week


# 6a8b7aa4 31-Jan-2018 Andriy Gapon <avg@FreeBSD.org>

vmm/svm: post LAPIC interrupts using event injection, not virtual interrupts

The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR
fields of VMCB to inject a virtual interrupt into a guest VM. This
method has many advantages over the direct event injection as it
offloads all decisions of whether and when the interrupt can be
delivered to the guest. But with a purely software emulated vAPIC the
advantage is also a problem. The problem is that the hypervisor does
not have any precise control over when the interrupt is actually
delivered to the guest (or a notification about that). Because of that
the hypervisor cannot update the interrupt vector in IRR and ISR in the
same way as real hardware would. The hypervisor becomes aware that the
interrupt is being serviced only upon the first VMEXIT after the
interrupt is delivered. This creates a window between the actual
interrupt delivery and the update of IRR and ISR. That means that IRR
and ISR might not be correctly set up to the point of the
end-of-interrupt signal.

The described deviation has been observed to cause an interrupt loss in
the following scenario. vCPU0 posts an inter-processor interrupt to
vCPU1. The interrupt is injected as a virtual interrupt by the
hypervisor. The interrupt is delivered to a guest and an interrupt
handler is invoked. The handler performs a requested action and
acknowledges the request by modifying a global variable. So far, there
is no VMEXIT and the hypervisor is unaware of the events. Then, vCPU0
notices the acknowledgment and sends another IPI with the same vector.
The IPI gets collapsed into the previous IPI in the IRR of vCPU1. Only
after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared
in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the
second IPI has never been sent.
The scenario is impossible on the real hardware because IRR and ISR are
updated just before the interrupt handler gets started.

I saw several possibilities of fixing the problem. One is to intercept
the virtual interrupt delivery to update IRR and ISR at the right
moment. The other is to deliver the LAPIC interrupts using the event
injection, same as legacy interrupts. I opted to use the latter
approach for several reasons. It's equivalent to what VMM/Intel does
(in !VMX case). It appears to be what VirtualBox and KVM do. The code
is already there (to support legacy interrupts).

Another possibility was to use a special intermediate state for a vector
after it is injected using a virtual interrupt and before it is known
whether it was accepted or is still pending.
That approach was implemented in https://reviews.freebsd.org/D13828
That method is more complex and does not have any clear advantage.

Please see sections 15.20 and 15.21.4 of "AMD64 Architecture
Programmer's Manual Volume 2: System Programming" (publication 24593,
revision 3.29) for comparison between event injection and virtual
interrupt injection.

PR: 215972
Reported by: ajschot@hotmail.com, grehan
Tested by: anish, grehan, Nils Beyer <nbe@renzel.net>
Reviewed by: anish, grehan
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13780


# 65eefbe4 17-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Save and restore guest debug registers.

Currently most of the debug registers are not saved and restored
during VM transitions allowing guest and host debug register values to
leak into the opposite context. One result is that hardware
watchpoints do not work reliably within a guest under VT-x.

Due to differences in SVM and VT-x, slightly different approaches are
used.

For VT-x:

- Enable debug register save/restore for VM entry/exit in the VMCS for
DR7 and MSR_DEBUGCTL.
- Explicitly save DR0-3,6 of the guest.
- Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from
%rflags for the host. Note that because DR6 is "software" managed
and not stored in the VMCS a kernel debugger which single steps
through VM entry could corrupt the guest DR6 (since a single step
trap taken after loading the guest DR6 could alter the DR6
register). To avoid this, explicitly disable single-stepping via
the trace flag before loading the guest DR6. A determined debugger
could still defeat this by setting a breakpoint after the guest DR6
was loaded and then single-stepping.

For SVM:
- Enable debug register caching in the VMCB for DR6/DR7.
- Explicitly save DR0-3 of the guest.
- Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM
saves the guest DR6 in the VMCB, the race with single-stepping
described for VT-x does not exist.

For both platforms, expose all of the guest DRx values via --get-drX
and --set-drX flags to bhyvectl.

Discussed with: avg, grehan
Tested by: avg (SVM), myself (VT-x)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D13229


# 091da2df 09-Jan-2018 Andriy Gapon <avg@FreeBSD.org>

vmm/svm: contigmalloc of the whole svm_softc is excessive

This is a followup to r307903.

struct svm_softc takes more than 200 kilobytes while what we really need
is 3 contiguous pages for I/O permission map and 2 contiguous pages for
MSR permission map. Other physically mapped structures have a size of
a single page, so a proper alignment is sufficient for their correct
mapping.

Thus, only the permission maps are allocated with contigmalloc now,
the softc is allocated with a regular malloc.

Additionally, this commit adds a check that malloc returns memory with the
expected page alignment and that contigmalloc does not fail.
Unfortunately, at present svm_vminit() is expected to always succeed and
there is no way to report an error.
So, a contigmalloc failure leads to a panic.
We should probably fix this.

MFC after: 2 weeks


# 978f3da1 26-Mar-2017 Andriy Gapon <avg@FreeBSD.org>

revert r315959 because it causes build problems

The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.

Also, the change could be not as useful as I thought it was.

Reported by: dchagin, Manfred Antar <null@pozo.com>, and many others


# a7b4c009 25-Mar-2017 Andriy Gapon <avg@FreeBSD.org>

specific end of interrupt implementation for AMD Local APIC

The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.

Reviewed by: kib
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9880


# 2f4c4321 28-Oct-2016 Andriy Gapon <avg@FreeBSD.org>

fix a syntax error in r308039 ...

that I somehow introduced between testing the change
iand committing it.

MFC after: 1 week
X-MFC with: r307903


# 211029ce 28-Oct-2016 Andriy Gapon <avg@FreeBSD.org>

vmm: another take at maximmum address passed to contigmalloc

Just using vm_paddr_t value with all bits set.
That should work as long as the type is unsigned.

While there, fix a couple of whitespace issues nearby.

MFC after: 1 week
X-MFC with: r307903


# 1ea77652 25-Oct-2016 Andriy Gapon <avg@FreeBSD.org>

fix up r307903, use correct max address definition

MFC after: 1 week
X-MFC with: r307903


# 3387e874 25-Oct-2016 Andriy Gapon <avg@FreeBSD.org>

vmm/svm: iopm_bitmap and msr_bitmap must be contiguous in physical memory

To achieve that the whole svm_softc is allocated with contigmalloc now.
It would be more effient to de-embed those arrays and allocate only them
with contigmalloc.

Previously, if malloc(9) used non-contiguous pages for the arrays, then
random bits in physical pages next to the first page would be used to
determine permissions for I/O port and MSR accesses. That could result
in a guest dangerously modifying the host hardware configuration.

One example is that sometimes NMI watchdog driver in a Linux guest
would be able to configure a performance counter on a host system.
The counter would generate an interrupt and if hwpmc(4) driver is loaded
on the host, then the interrupt would be delivered as an NMI.

Discussed with: jhb
Reviewed by: grehan
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8321


# a1e1814d 22-Feb-2016 Svatopluk Kraus <skra@FreeBSD.org>

As <machine/pmap.h> is included from <vm/pmap.h>, there is no need to
include it explicitly when <vm/pmap.h> is already included.

Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D5373


# 90e528f8 22-Jun-2015 Neel Natu <neel@FreeBSD.org>

Restore the host's GS.base before returning from 'svm_launch()'.

Previously this was done by the caller of 'svm_launch()' after it returned.
This works fine as long as no code is executed in the interim that depends
on pcpu data.

The dtrace probe 'fbt:vmm:svm_launch:return' broke this assumption because
it calls 'dtrace_probe()' which in turn relies on pcpu data.

Reported by: avg
MFC after: 1 week


# 9b1aa8d6 18-Jun-2015 Neel Natu <neel@FreeBSD.org>

Restructure memory allocation in bhyve to support "devmem".

devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer
where doing a trap-and-emulate for every access is impractical. devmem is a
hybrid of system memory (sysmem) and emulated device models.

devmem is mapped in the guest address space via nested page tables similar
to sysmem. However the address range where devmem is mapped may be changed
by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is
usually mapped RO or RW as compared to RWX mappings for sysmem.

Each devmem segment is named (e.g. "bootrom") and this name is used to
create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom).
The device node supports mmap(2) and this decouples the host mapping of
devmem from its mapping in the guest address space (which can change).

Reviewed by: tychon
Discussed with: grehan
Differential Revision: https://reviews.freebsd.org/D2762
MFC after: 4 weeks


# b14bd6ac 03-Jun-2015 Neel Natu <neel@FreeBSD.org>

Use tunable 'hw.vmm.svm.features' to disable specific SVM features even
though they might be available in hardware.

Use tunable 'hw.vmm.svm.num_asids' to limit the number of ASIDs used by
the hypervisor.

MFC after: 1 week


# 248e6799 28-May-2015 Neel Natu <neel@FreeBSD.org>

Fix non-deterministic delays when accessing a vcpu that was in "running" or
"sleeping" state. This is done by forcing the vcpu to transition to "idle"
by returning to userspace with an exit code of VM_EXITCODE_REQIDLE.

MFC after: 2 weeks


# ea91ca92 05-May-2015 Neel Natu <neel@FreeBSD.org>

Do a proper emulation of guest writes to MSR_EFER.
- Must-Be-Zero bits cannot be set.
- EFER_LME and EFER_LMA should respect the long mode consistency checks.
- EFER_NXE, EFER_FFXSR, EFER_TCE can be set if allowed by CPUID capabilities.
- Flag an error if guest tries to set EFER_LMSLE since bhyve doesn't enforce
segment limits in 64-bit mode.

MFC after: 2 weeks


# dbec2c5c 22-Apr-2015 Marcelo Araujo <araujo@FreeBSD.org>

Missing break in switch case.

Differential Revision: D2342
Reviewed by: neel


# 7c0b0b9a 16-Apr-2015 Neel Natu <neel@FreeBSD.org>

Prefer 'vcpu_should_yield()' over checking 'curthread->td_flags' directly.

MFC after: 1 week


# e4f605ee 24-Mar-2015 Tycho Nightingale <tychon@FreeBSD.org>

When fetching an instruction in non-64bit mode, consider the value of the
code segment base address.

Also if an instruction doesn't support a mod R/M (modRM) byte, don't
be concerned if the CPU is in real mode.

Reviewed by: neel


# 7d69783a 02-Mar-2015 Neel Natu <neel@FreeBSD.org>

Fix warnings/errors when building vmm.ko with gcc:

- fix warning about comparison of 'uint8_t v_tpr >= 0' always being true.

- fix error triggered by an empty clobber list in the inline assembly for
"clgi" and "stgi"

- fix error when compiling "vmload %rax", "vmrun %rax" and "vmsave %rax". The
gcc assembler does not like the explicit operand "%rax" while the clang
assembler requires specifying the operand "%rax". Fix this by encoding the
instructions using the ".byte" directive.

Reported by: julian
MFC after: 1 week


# e09ff171 23-Jan-2015 Neel Natu <neel@FreeBSD.org>

Add macro to identify AVIC capability (advanced virtual interrupt controller)
in AMD processors.

Submitted by: Dmitry Luhtionov (dmitryluhtionov@gmail.com)


# c9c75df4 13-Jan-2015 Neel Natu <neel@FreeBSD.org>

'struct vm_exception' was intended to be used only as the collateral for the
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.

Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.

Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.

MFC after: 1 week


# 2ce12423 06-Jan-2015 Neel Natu <neel@FreeBSD.org>

Clear blocking due to STI or MOV SS in the hypervisor when an instruction is
emulated or when the vcpu incurs an exception. This matches the CPU behavior.

Remove special case code in HLT processing that was clearing the interrupt
shadow. This is now redundant because the interrupt shadow is always cleared
when the vcpu is resumed after an instruction is emulated.

Reported by: David Reed (david.reed@tidalscale.com)
MFC after: 2 weeks


# cd86d363 30-Dec-2014 Neel Natu <neel@FreeBSD.org>

Initialize all fields of 'struct vm_exception exception' before passing it to
vm_inject_exception(). This fixes the issue that 'exception.cpuid' is
uninitialized when calling 'vm_inject_exception()'.

However, in practice this change is a no-op because vm_inject_exception()
does not use 'exception.cpuid' for anything.

Reported by: Coverity Scan
CID: 1261297
MFC after: 3 days


# 95474bc2 29-Dec-2014 Neel Natu <neel@FreeBSD.org>

Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT' on
an AMD/SVM host.

MFC after: 1 week


# b0538143 22-Dec-2014 Neel Natu <neel@FreeBSD.org>

Allow ktr(4) tracing of all guest exceptions via the tunable
"hw.vmm.trace_guest_exceptions". To enable this feature set the tunable
to "1" before loading vmm.ko.

Tracing the guest exceptions can be useful when debugging guest triple faults.

Note that there is a performance impact when exception tracing is enabled
since every exception will now trigger a VM-exit.

Also, handle machine check exceptions that happen during guest execution
by vectoring to the host's machine check handler via "int $18".

Discussed with: grehan
MFC after: 2 weeks


# f1be09bd 27-Oct-2014 Peter Grehan <grehan@FreeBSD.org>

Remove bhyve SVM feature printf's now that they are available in the
general CPU feature detection code.

Reviewed by: neel


# b1cf7bb5 16-Oct-2014 Neel Natu <neel@FreeBSD.org>

Use the correct fault type (VM_PROT_EXECUTE) for an instruction fetch.


# 8fe9436d 10-Oct-2014 Neel Natu <neel@FreeBSD.org>

Get rid of unused headers.
Restrict scope of malloc types M_SVM and M_SVM_VLAPIC by making them static.
Replace ERR() with KASSERT().
style(9) cleanup.


# 882a1f19 10-Oct-2014 Neel Natu <neel@FreeBSD.org>

Use a consistent style for messages emitted when the module is loaded.


# 30571674 26-Sep-2014 Neel Natu <neel@FreeBSD.org>

Simplify register state save and restore across a VMRUN:

- Host registers are now stored on the stack instead of a per-cpu host context.

- Host %FS and %GS selectors are not saved and restored across VMRUN.
- Restoring the %FS/%GS selectors was futile anyways since that only updates
the low 32 bits of base address in the hidden descriptor state.
- GS.base is properly updated via the MSR_GSBASE on return from svm_launch().
- FS.base is not used while inside the kernel so it can be safely ignored.

- Add function prologue/epilogue so svm_launch() can be traced with Dtrace's
FBT entry/exit probes. They also serve to save/restore the host %rbp across
VMRUN.

Reviewed by: grehan
Discussed with: Anish Gupta (akgupt3@gmail.com)


# af198d88 21-Sep-2014 Neel Natu <neel@FreeBSD.org>

Allow more VMCB fields to be cached:
- CR2
- CR0, CR3, CR4 and EFER
- GDT/IDT base/limit fields
- CS/DS/ES/SS selector/base/limit/attrib fields

The caching can be further restricted via the tunable 'hw.vmm.svm.vmcb_clean'.

Restructure the code such that the fields above are only modified in a single
place. This makes it easy to invalidate the VMCB cache when any of these fields
is modified.


# 6b844b87 16-Sep-2014 Neel Natu <neel@FreeBSD.org>

Rework vNMI injection.

Keep track of NMI blocking by enabling the IRET intercept on a successful
vNMI injection. The NMI blocking condition is cleared when the handler
executes an IRET and traps back into the hypervisor.

Don't inject NMI if the processor is in an interrupt shadow to preserve the
atomic nature of "STI;HLT". Take advantage of this and artificially set the
interrupt shadow to prevent NMI injection when restarting the "iret".

Reviewed by: Anish Gupta (akgupt3@gmail.com), grehan


# 5fb3bc71 15-Sep-2014 Neel Natu <neel@FreeBSD.org>

Minor cleanup.

Get rid of unused 'svm_feature' from the softc.

Get rid of the redundant 'vcpu_cnt' checks in svm.c. There is a similar check
in vmm.c against 'vm->active_cpus' before the AMD-specific code is called.

Submitted by: Anish Gupta (akgupt3@gmail.com)


# 79ad53fb 15-Sep-2014 Neel Natu <neel@FreeBSD.org>

Use V_IRQ, V_INTR_VECTOR and V_TPR to offload APIC interrupt delivery to the
processor. Briefly, the hypervisor sets V_INTR_VECTOR to the APIC vector
and sets V_IRQ to 1 to indicate a pending interrupt. The hardware then takes
care of injecting this vector when the guest is able to receive it.

Legacy PIC interrupts are still delivered via the event injection mechanism.
This is because the vector injected by the PIC must reflect the state of its
pins at the time the CPU is ready to accept the interrupt.

Accesses to the TPR via %CR8 are handled entirely in hardware. This requires
that the emulated TPR must be synced to V_TPR after a #VMEXIT.

The guest can also modify the TPR via the memory mapped APIC. This requires
that the V_TPR must be synced with the emulated TPR before a VMRUN.

Reviewed by: Anish Gupta (akgupt3@gmail.com)


# bbadcde4 13-Sep-2014 Neel Natu <neel@FreeBSD.org>

Set the 'vmexit->inst_length' field properly depending on the type of the
VM-exit and ultimately on whether nRIP is valid. This allows us to update
the %rip after the emulation is finished so any exceptions triggered during
the emulation will point to the right instruction.

Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability
is available. The effective segment field in EXITINFO1 is not valid without
this capability.

Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the
VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging.

Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2
fields in bhyve(8).

Reviewed by: Anish Gupta (akgupt3@gmail.com)
Reviewed by: grehan


# 74accc31 13-Sep-2014 Neel Natu <neel@FreeBSD.org>

Bug fixes.

- Don't enable the HLT intercept by default. It will be enabled by bhyve(8)
if required. Prior to this change HLT exiting was always enabled making
the "-H" option to bhyve(8) meaningless.

- Recognize a VM exit triggered by a non-maskable interrupt. Prior to this
change the exit would be punted to userspace and the virtual machine would
terminate.


# fa7caa91 13-Sep-2014 Neel Natu <neel@FreeBSD.org>

style(9): insert an empty line if the function has no local variables

Pointed out by: grehan


# c2a875f9 13-Sep-2014 Neel Natu <neel@FreeBSD.org>

AMD processors that have the SVM decode assist capability will store the
instruction bytes in the VMCB on a nested page fault. This is useful because
it saves having to walk the guest page tables to fetch the instruction.

vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len'
that map directly to 'vie->inst[]' and 'vie->num_valid'.

The instruction emulation handler skips calling 'vmm_fetch_instruction()'
if 'vie->num_valid' is non-zero.

The use of this capability can be turned off by setting the sysctl/tunable
'hw.vmm.svm.disable_npf_assist' to '1'.

Reviewed by: Anish Gupta (akgupt3@gmail.com)
Discussed with: grehan


# 442a04ca 11-Sep-2014 Neel Natu <neel@FreeBSD.org>

style(9): indent the switch, don't indent the case, indent case body one tab.


# e441104d 10-Sep-2014 Neel Natu <neel@FreeBSD.org>

Repurpose the V_IRQ interrupt injection to implement VMX-style interrupt
window exiting. This simply involves setting V_IRQ and enabling the VINTR
intercept. This instructs the CPU to trap back into the hypervisor as soon
as an interrupt can be injected into the guest. The pending interrupt is
then injected via the traditional event injection mechanism.

Rework vcpu interrupt injection so that Linux guests now idle with host cpu
utilization close to 0%.

Reviewed by: Anish Gupta (earlier version)
Discussed with: grehan


# 238b6cb7 09-Sep-2014 Neel Natu <neel@FreeBSD.org>

Allow intercepts and irq fields to be cached by the VMCB.

Provide APIs svm_enable_intercept()/svm_disable_intercept() to add/delete
VMCB intercepts. These APIs ensure that the VMCB state cache is invalidated
when intercepts are modified.

Each intercept is identified as a (index,bitmask) tuple. For e.g., the
VINTR intercept is identified as (VMCB_CTRL1_INTCPT,VMCB_INTCPT_VINTR).
The first 20 bytes in control area that are used to enable intercepts
are represented as 'uint32_t intercept[5]' in 'struct vmcb_ctrl'.

Modify svm_setcap() and svm_getcap() to use the new APIs.

Discussed with: Anish Gupta (akgupt3@gmail.com)


# e5397c9f 09-Sep-2014 Neel Natu <neel@FreeBSD.org>

Move the VMCB initialization into svm.c in preparation for changes to the
interrupt injection logic.

Discussed with: Anish Gupta (akgupt3@gmail.com)


# 840b1a27 09-Sep-2014 Neel Natu <neel@FreeBSD.org>

Move the event injection function into svm.c and add KTR logging for
every event injection.

This in in preparation for changes to SVM guest interrupt injection.

Discussed with: Anish Gupta (akgupt3@gmail.com)


# 2591ee3e 09-Sep-2014 Neel Natu <neel@FreeBSD.org>

Remove a bogus check that flagged an error if the guest %rip was zero.

An AP begins execution with %rip set to 0 after a startup IPI.

Discussed with: Anish Gupta (akgupt3@gmail.com)


# 5e467bd0 09-Sep-2014 Neel Natu <neel@FreeBSD.org>

Make the KTR tracepoints uniform and ensure that every VM-exit is logged.

Discussed with: Anish Gupta (akgupt3@gmail.com)


# a2901ce7 09-Sep-2014 Neel Natu <neel@FreeBSD.org>

Allow guest read access to MSR_EFER without hypervisor intervention.

Dirty the VMCB_CACHE_CR state cache when MSR_EFER is modified.


# 501f03eb 09-Sep-2014 Neel Natu <neel@FreeBSD.org>

Remove gratuitous forward declarations.
Remove tabs on empty lines.


# a2684814 06-Sep-2014 Neel Natu <neel@FreeBSD.org>

Do proper ASID management for guest vcpus.

Prior to this change an ASID was hard allocated to a guest and shared by all
its vcpus. The meant that the number of VMs that could be created was limited
to the number of ASIDs supported by the CPU. It was also inefficient because
it forced a TLB flush on every VMRUN.

With this change the number of guests that can be created is independent of
the number of available ASIDs. Also, the TLB is flushed only when a new ASID
is allocated.

Discussed with: grehan
Reviewed by: Anish Gupta (akgupt3@gmail.com)


# 38658797 04-Sep-2014 Neel Natu <neel@FreeBSD.org>

Merge svm_set_vmcb() and svm_init_vmcb() into a single function that is called
just once when a vcpu is initialized.

Discussed with: Anish Gupta (akgupt3@gmail.com)


# fea6bd5c 04-Sep-2014 Neel Natu <neel@FreeBSD.org>

Consolidate the code to restore the host TSS after a #VMEXIT into a single
function restore_host_tss().

Don't bother to restore MSR_KGSBASE after a #VMEXIT since it is not used in
the kernel. It will be restored on return to userspace.

Discussed with: Anish Gupta (akgupt3@gmail.com)


# 48e8c213 24-Aug-2014 Neel Natu <neel@FreeBSD.org>

An exception is allowed to be injected even if the vcpu is in an interrupt
shadow, so move the check for pending exception before bailing out due to
an interrupt shadow.

Change return type of 'vmcb_eventinject()' to a void and convert all error
returns into KASSERTs.

Fix VMCB_EXITINTINFO_EC(x) and VMCB_EXITINTINFO_TYPE(x) to do the shift
before masking the result.

Reviewed by: Anish Gupta (akgupt3@gmail.com)


# 4e98fc90 11-Jun-2014 Neel Natu <neel@FreeBSD.org>

Disable global interrupts early so all the software state maintained by bhyve
is sampled "atomically". Any interrupts after this point will be held pending
by the CPU until the guest starts executing and will immediately trigger a
#VMEXIT.

Reviewed by: Anish Gupta (akgupt3@gmail.com)


# 37871487 09-Jun-2014 Peter Grehan <grehan@FreeBSD.org>

Temporary fix for guest idle detection.

Handle ExtINT injection for SVM. The HPET emulation
will inject a legacy interrupt at startup, and if this
isn't handled, will result in the HLT-exit code assuming
there are outstanding ExtINTs and return without sleeping.

svm_inj_interrupts() needs more changes to bring it up
to date with the VT-x version: these are forthcoming.

Reviewed by: neel


# 1cc0e0ee 07-Jun-2014 Peter Grehan <grehan@FreeBSD.org>

Allow the TSC MSR to be accessed directly from the guest.


# 0df5b8cb 05-Jun-2014 Peter Grehan <grehan@FreeBSD.org>

ins/outs support for SVM. Modelled on the Intel VT-x code.

Remove CR2 save/restore - the guest restore/save is done
in hardware, and there is no need to save/restore the host
version (same as VT-x).

Submitted by: neel (SVM segment descriptor 'P' bit code)
Reviewed by: neel


# 8c1da7e6 03-Jun-2014 Peter Grehan <grehan@FreeBSD.org>

Use API call when VM is detected as suspended. This fixes
the (harmless) error message on exit:

vmexit_suspend: invalid reason 217645057

Reviewed by: neel, Anish Gupta (akgupt3@gmail.com)


# eee8190a 03-Jun-2014 Peter Grehan <grehan@FreeBSD.org>

Bring (almost) up-to-date with HEAD.

- use the new virtual APIC page
- update to current bhyve APIs

Tested by Anish with multiple FreeBSD SMP VMs on a Phenom,
and verified by myself with light FreeBSD VM testing
on a Sempron 3850 APU.

The issues reported with Linux guests are very likely to still
be here, but this sync eliminates the skew between the
project branch and CURRENT, and should help to determine
the causes.

Some follow-on commits will fix minor cosmetic issues.

Submitted by: Anish Gupta (akgupt3@gmail.com)


# cde843b4 04-Feb-2014 Peter Grehan <grehan@FreeBSD.org>

Changes to the SVM code to bring it up to r259205

- Convert VMM_CTR to VCPU_CTR KTR macros
- Special handling of halt, save rflags for VMM layer to emulate
halt for vcpu(sleep to be awakened by interrupt or stop it)
- Cleanup of RVI exit handling code

Submitted by: Anish Gupta (akgupt3@gmail.com)
Reviewed by: grehan


# a0b78f09 18-Dec-2013 Peter Grehan <grehan@FreeBSD.org>

Enable memory overcommit for AMD processors.

- No emulation of A/D bits is required since AMD-V RVI
supports A/D bits.
- Enable pmap PT_RVI support(w/o PAT) which is required for
memory over-commit support.
- Other minor fixes:
* Make use of VMCB EXITINTINFO field. If a #VMEXIT happens while
delivering an interrupt, EXITINTINFO has all the details that bhyve
needs to inject the same interrupt.
* SVM h/w decode assist code was incomplete - removed for now.
* Some minor code clean-up (more coming).

Submitted by: Anish Gupta (akgupt3@gmail.com)


# ab76fd58 21-Oct-2013 Neel Natu <neel@FreeBSD.org>

The ASID allocation in SVM is incorrect because it allocates a single ASID for
all vcpus belonging to a guest. This means that when different vcpus belonging
to the same guest are executing on the same host cpu there may be "leakage"
in the mappings created by one vcpu to another.

The proper fix for this is being worked on and will be committed shortly.

In the meantime workaround this bug by flushing the guest TLB entries on every
VM entry.

Submitted by: Anish Gupta (akgupt3@gmail.com)


# 4599af43 15-Oct-2013 Peter Grehan <grehan@FreeBSD.org>

Fix SVM handling of ASTPENDING, which manifested as a
hang on console output (due to a missing interrupt).

SVM does exit processing and then handles ASTPENDING which
overwrites the already handled SVM exit cause and corrupts
virtual machine state. For example, if the SVM exit was due to
an I/O port access but the main loop detected an ASTPENDING,
the exit would be processed as ASTPENDING and leave the
device (e.g. emulated UART) for that I/O port in bad state.

Submitted by: Anish Gupta (akgupt3@gmail.com)
Reviewed by: grehan


# df5e6de3 22-Aug-2013 Peter Grehan <grehan@FreeBSD.org>

Add in last remaining files to get AMD-SVM operational.

Submitted by: Anish Gupta (akgupt3@gmail.com)