Cross Reference: /freebsd-current/sys/amd64/include/vmm.h

History log of /freebsd-current/sys/amd64/include/vmm.h
Revision	Date	Author	Comments
# 1eedb4e5	11-Apr-2024	Elyes Haouas <ehaouas@noos.fr>	vmm: Fix typo Signed-off-by: Elyes Haouas <ehaouas@noos.fr> Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/885
# f493ea65	07-Feb-2024	Mark Johnston <markj@FreeBSD.org>	vmm: Expose more registers to VM_GET_REGISTER In a follow-up revision the gdb stub will support sending an XML target description to gdb, which lets us send additional registers, including the ones added in this patch. Reviewed by: jhb MFC after: 1 month Sponsored by: Innovate UK Differential Revision: https://reviews.freebsd.org/D43665
# 7c8f1631	20-Dec-2023	Konstantin Belousov <kib@FreeBSD.org>	vmm.h: remove dup declaration Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43139
# e3b4fe64	07-Dec-2023	Bojan Novković <bojan.novkovic@fer.hr>	vmm: implement single-stepping for AMD CPUs This patch implements single-stepping for AMD CPUs using the RFLAGS.TF single-stepping mechanism. The GDB stub requests single-stepping using the VM_CAP_RFLAGS_TF capability. Setting this capability will set the RFLAGS.TF bit on the selected vCPU, activate DB exception intercepts, and activate POPF/PUSH instruction intercepts. The resulting DB exception is then caught by the IDT_DB vmexit handler and bounced to userland where it is processed by the GDB stub. This patch also makes sure that the value of the TF bit is correctly updated and that it is not erroneously propagated into memory. Stepping over PUSHF will cause the vm_handle_db function to correct the pushed RFLAGS value and stepping over POPF will update the shadowed TF bit copy. Reviewed by: jhb Sponsored by: Google, Inc. (GSoC 2022) Differential Revision: https://reviews.freebsd.org/D42296
# 95ee2897	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: two-line .h pattern Remove /^\s\\n \*\s+\$FreeBSD\$$\n/
# e17eca32	23-May-2023	Mark Johnston <markj@FreeBSD.org>	vmm: Avoid embedding cpuset_t ioctl ABIs Commit 0bda8d3e9f7a ("vmm: permit some IPIs to be handled by userspace") embedded cpuset_t into the vmm(4) ioctl ABI. This was a mistake since we otherwise have some leeway to change the cpuset_t for the whole system, but we want to keep the vmm ioctl ABI stable. Rework IPI reporting to avoid this problem. Along the way, make VM_RUN a bit more efficient: - Split vmexit metadata out of the main VM_RUN structure. This data is only written by the kernel. - Have userspace pass a cpuset_t pointer and cpusetsize in the VM_RUN structure, as is done for cpuset syscalls. - Have the destination CPU mask for VM_EXITCODE_IPIs live outside the vmexit info structure, and make VM_RUN copy it out separately. Zero out any extra bytes in the CPU mask, like cpuset syscalls do. - Modify the vmexit handler prototype to take a full VM_RUN structure. PR: 271330 Reviewed by: corvink, jhb (previous versions) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D40113
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# fefac543	09-May-2023	Bojan Novković <bojan.novkovic@fer.hr>	bhyve: fix vCPU single-stepping on VMX This patch fixes virtual machine single stepping on VMX hosts. Currently, when using bhyve's gdb stub, each attempt at single-stepping a vCPU lands in a timer interrupt. The current single-stepping mechanism uses the Monitor Trap Flag feature to cause VMEXIT after a single instruction is executed. Unfortunately, the SDM states that MTF causes VMEXITs for the next instruction that gets executed, which is often not what the person using the debugger expects. [1] This patch adds a new VM capability that masks interrupts on a vCPU by blocking interrupt injection and modifies the gdb stub to use the newly added capability while single-stepping a vCPU. [1] Intel SDM 26.5.2 Vol. 3C Reviewed by: corvink, jbh MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D39949
# 7d9ef309	24-Mar-2023	John Baldwin <jhb@FreeBSD.org>	libvmmapi: Add a struct vcpu and use it in most APIs. This replaces the 'struct vm, int vcpuid' tuple passed to most API calls and is similar to the changes recently made in vmm(4) in the kernel. struct vcpu is an opaque type managed by libvmmapi. For now it stores a pointer to the VM context and an integer id. As an immediate effect this removes the divergence between the kernel and userland for the instruction emulation code introduced by the recent vmm(4) changes. Since this is a major change to the vmmapi API, bump VMMAPI_VERSION to 0x200 (2.0) and the shared library major version. While here (and since the major version is bumped), remove unused vcpu argument from vm_setup_pptdev_msi*(). Add new functions vm_suspend_all_cpus() and vm_resume_all_cpus() for use by the debug server. The underyling ioctl (which uses a vcpuid of -1) remains unchanged, but the userlevel API now uses separate functions for global CPU suspend/resume. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D38124
# 8104fc31	28-Feb-2023	Vitaliy Gusev <gusev.vitaliy@gmail.com>	bhyve: fix restore of kernel structs vmx_snapshot() and svm_snapshot() do not save any data and error occurs at resume: Restoring kernel structs... vm_restore_kern_struct: Kernel struct size was 0 for: vmx Failed to restore kernel structs. Reviewed by: corvink, markj Fixes: 39ec056e6dbd89e26ee21d2928dbd37335de0ebc ("vmm: Rework snapshotting of CPU-specific per-vCPU data.") MFC after: 2 weeks Sponsored by: vStack Differential Revision: https://reviews.freebsd.org/D38476
# 892feec2	15-Nov-2022	Corvin Köhne <corvink@FreeBSD.org>	vmm: avoid spurious rendezvous A vcpu only checks if a rendezvous is in progress or not to decide if it should handle a rendezvous. This could lead to spurios rendezvous where a vcpu tries a handle a rendezvous it isn't part of. This situation is properly handled by vm_handle_rendezvous but it could potentially degrade the performance. Avoid that by an early check if the vcpu is part of the rendezvous or not. At the moment, rendezvous are only used to spin up application processors and to send ioapic interrupts. Spinning up application processors is done in the guest boot phase by sending INIT SIPI sequences to single vcpus. This is known to cause spurious rendezvous and only occurs in the boot phase. Sending ioapic interrupts is rare because modern guest will use msi and the rendezvous is always send to all vcpus. Reviewed by: jhb MFC after: 1 week Sponsored by: Beckhoff Automation GmbH & Co. KG Differential Revision: https://reviews.freebsd.org/D37390
# 1f6db5d6	30-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Remove stale comment for vm_rendezvous. Support for rendezvous outside of a vcpu context (vcpuid of -1) was removed in commit 949f0f47a4e7, and the vm, vcpuid argument pair was replaced by a single struct vcpu pointer in commit d8be3d523dd5. Reported by: andrew
# ca6b48f0	18-Nov-2022	Mark Johnston <markj@FreeBSD.org>	vmm: Restore the correct vm_inject_*() prototypes Fixes: 80cb5d845b8f ("vmm: Pass vcpu instead of vm and vcpuid...") Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D37443
# ee98f99d	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Convert VM_MAXCPU into a loader tunable hw.vmm.maxcpu. The default is now the number of physical CPUs in the system rather than 16. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37175
# 98568a00	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Allocate vCPUs on first use of a vCPU. Convert the vcpu[] array in struct vm to an array of pointers and allocate vCPUs on first use. This avoids always allocating VM_MAXCPU vCPUs for each VM, but instead only allocates the vCPUs in use. A new per-VM sx lock is added to serialize attempts to allocate vCPUs on first use. However, a given vCPU is never freed while the VM is active, so the pointer is read via an unlocked read first to avoid the need for the lock in the common case once the vCPU has been created. Some ioctls need to lock all vCPUs. To prevent races with ioctls that want to allocate a new vCPU, these ioctls also lock the sx lock that protects vCPU creation. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37174
# c0f35dbf	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Use a cpuset_t for vCPUs waiting for STARTUP IPIs. Retire the boot_state member of struct vlapic and instead use a cpuset in the VM to track vCPUs waiting for STARTUP IPIs. INIT IPIs add vCPUs to this set, and STARTUP IPIs remove vCPUs from the set. STARTUP IPIs are only reported to userland for vCPUs that were removed from the set. In particular, this permits a subsequent change to allocate vCPUs on demand when the vCPU may not be allocated until after a STARTUP IPI is reported to userland. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37173
# 67b69e76	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Use an sx lock to protect the memory map. Previously bhyve obtained a "read lock" on the memory map for ioctls needing to read the map by locking the last vCPU. This is now replaced by a new per-VM sx lock. Modifying the map requires exclusively locking the sx lock as well as locking all existing vCPUs. Reading the map requires either locking one vCPU or the sx lock. This permits safely modifying or querying the memory map while some vCPUs do not exist which will be true in a future commit. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37172
# 3f0f4b15	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Lookup vcpu pointers in vmmdev_ioctl. Centralize mapping vCPU IDs to struct vcpu objects in vmmdev_ioctl and pass vcpu pointers to the routines in vmm.c. For operations that want to perform an action on all vCPUs or on a single vCPU, pass pointers to both the VM and the vCPU using a NULL vCPU pointer to request global actions. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37168
# d8be3d52	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Use struct vcpu in the rendezvous code. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37165
# 80cb5d84	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Pass vcpu instead of vm and vcpuid to APIs used from CPU backends. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37162
# d3956e46	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Use struct vcpu in the instruction emulation code. This passes struct vcpu down in place of struct vm and and integer vcpu index through the in-kernel instruction emulation code. To minimize userland disruption, helper macros are used for the vCPU arguments passed into and through the shared instruction emulation code. A few other APIs used by the instruction emulation code have also been updated to accept struct vcpu in the kernel including vm_get/set_register and vm_inject_fault. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37161
# 28b561ad	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Add vm_gpa_hold_global wrapper function. This handles the case that guest pages are being held not on behalf of a virtual CPU but globally. Previously this was handled by passing a vcpuid of -1 to vm_gpa_hold, but that will not work in the future when vm_gpa_hold is changed to accept a struct vcpu pointer. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37160
# 2b4fe856	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	bhyve: Remove unused vm and vcpu arguments from vm_copy routines. The arguments identifying the VM and vCPU are only needed for vm_copy_setup. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37158
# 3dc3d32a	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Use struct vcpu with the vmm_stat API. The function callbacks still use struct vm and and vCPU index. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37157
# 950af9ff	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Expose struct vcpu as an opaque type. Pass a pointer to the current struct vcpu to the vcpu_init callback and save this pointer in the CPU-specific vcpu structures. Add routines to fetch a struct vcpu by index from a VM and to query the VM and vcpuid from a struct vcpu. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37156
# 869c8d19	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Remove the per-vm cookie argument from vmmops taking a vcpu. This requires storing a reference to the per-vm cookie in the CPU-specific vCPU structure. Take advantage of this new field to remove no-longer-needed function arguments in the CPU-specific backends. In particular, stop passing the per-vm cookie to functions that either don't use it or only use it for KTR traces. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37152
# 1aa51504	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Refactor storage of CPU-dependent per-vCPU data. Rather than storing static arrays of per-vCPU data in the CPU-specific per-VM structure, adopt a more dynamic model similar to that used to manage CPU-specific per-VM data. That is, add new vmmops methods to init and cleanup a single vCPU. The init method returns a pointer that is stored in 'struct vcpu' as a cookie pointer. This cookie pointer is now passed to other vmmops callbacks in place of the integer index. The index is now only used in KTR traces and when calling back into the CPU-independent layer. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37151
# 39ec056e	18-Nov-2022	John Baldwin <jhb@FreeBSD.org>	vmm: Rework snapshotting of CPU-specific per-vCPU data. Previously some per-vCPU state was saved in vmmops_snapshot and other state was saved in vmmops_vcmx_snapshot. Consolidate all per-vCPU state into the latter routine and rename the hook to the more generic 'vcpu_snapshot'. Note that the CPU-independent per-vCPU data is still stored in a separate blob as well as the per-vCPU local APIC data. Reviewed by: corvink, markj Differential Revision: https://reviews.freebsd.org/D37150
# 0bda8d3e	07-Sep-2022	Corvin Köhne <CorvinK@beckhoff.com>	vmm: permit some IPIs to be handled by userspace Add VM_EXITCODE_IPI to permit returning unhandled IPIs to userland. INIT and STARTUP IPIs are now returned to userland. Due to backward compatibility reasons, a new capability is added for enabling VM_EXITCODE_IPI. Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D35623 Sponsored by: Beckhoff Automation GmbH & Co. KG
# 3fc17484	09-Sep-2022	Emmanuel Vadot <manu@FreeBSD.org>	Revert "vmm: permit some IPIs to be handled by userspace" This reverts commit a5a918b7a906eaa88e0833eac70a15989d535b02. This cause some problem with vm using bhyveload. Reported by: pho, kp
# a5a918b7	07-Sep-2022	Corvin Köhne <CorvinK@beckhoff.com>	vmm: permit some IPIs to be handled by userspace Add VM_EXITCODE_IPI to permit returning unhandled IPIs to userland. INIT and Startup IPIs are now returned to userland. Due to backward compatibility reasons, a new capability is added for enabling VM_EXITCODE_IPI. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D35623 Sponsored by: Beckhoff Automation GmbH & Co. KG
# c6d31b83	18-Jul-2022	Konstantin Belousov <kib@FreeBSD.org>	AST: rework Make most AST handlers dynamically registered. This allows to have subsystem-specific handler source located in the subsystem files, instead of making subr_trap.c aware of it. For instance, signal delivery code on return to userspace is now moved to kern_sig.c. Also, it allows to have some handlers designated as the cleanup (kclear) type, which are called both at AST and on thread/process exit. For instance, ast(), exit1(), and NFS server no longer need to be aware about UFS softdep processing. The dynamic registration also allows third-party modules to register AST handlers if needed. There is one caveat with loadable modules: the code does not make any effort to ensure that the module is not unloaded before all threads processed through AST handler in it. In fact, this is already present behavior for hwpmc.ko and ufs.ko. I do not think it is worth the efforts and the runtime overhead to try to fix it. Reviewed by: markj Tested by: emaste (arm64), pho Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888
# 3ba952e1	30-May-2022	Corvin Köhne <CorvinK@beckhoff.com>	vmm: add tunable to trap WBINVD x86 is cache coherent. However, there are special cases where cache coherency isn't ensured (e.g. when switching the caching mode). In these cases, WBINVD can be used. WBINVD writes all cache lines back into main memory and invalidates the whole cache. Due to the invalidation of the whole cache, WBINVD is a very heavy instruction and degrades the performance on all cores. So, we should minimize the use of WBINVD as much as possible. In a virtual environment, the WBINVD call is mostly useless. The guest isn't able to break cache coherency because he can't switch the physical cache mode. When using pci passthrough WBINVD might be useful. Nevertheless, trapping and ignoring WBINVD is an unsafe operation. For that reason, we implement it as tunable. Reviewed by: jhb Sponsored by: Beckhoff Automation GmbH & Co. KG MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D35253
# f1d450dd	09-Mar-2022	John Baldwin <jhb@FreeBSD.org>	bhyve: Remove VM_MAXCPU from the userspace API/ABI. Reviewed by: grehan Differential Revision: https://reviews.freebsd.org/D34494
# f8a6ec2d	18-Mar-2021	D Scott Phillips <scottph@FreeBSD.org>	bhyve: support relocating fbuf and passthru data BARs We want to allow the UEFI firmware to enumerate and assign addresses to PCI devices so we can boot from NVMe[1]. Address assignment of PCI BARs is properly handled by the PCI emulation code in general, but a few specific cases need additional support. fbuf and passthru map additional objects into the guest physical address space and so need to handle address updates. Here we add a callback to emulated PCI devices to inform them of a BAR configuration change. fbuf and passthru then watch for these BAR changes and relocate the frame buffer memory segment and passthru device mmio area respectively. We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls to vmm(4) to facilitate the unmapping needed for addres updates. [1]: https://github.com/freebsd/uefi-edk2/pull/9/ Originally by: scottph MFC After: 1 week Sponsored by: Intel Corporation Reviewed by: grehan Approved by: philip (mentor) Differential Revision: https://reviews.freebsd.org/D24066
# 15add60d	27-Nov-2020	Peter Grehan <grehan@FreeBSD.org>	Convert vmm_ops calls to IFUNC There is no need for these to be function pointers since they are never modified post-module load. Rename AMD/Intel ops to be more consistent. Submitted by: adam_fenn.io Reviewed by: markj, grehan Approved by: grehan (bhyve) MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D27375
# 543769bf	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	amd64: clean up empty lines in .c and .h files
# f3eb12e4	23-Aug-2020	Konstantin Belousov <kib@FreeBSD.org>	Add bhyve support for LA57 guest mode. Noted and reviewed by: grehan Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25273
# f5f5f1e7	18-Aug-2020	Peter Grehan <grehan@FreeBSD.org>	Support guest rdtscp and rdpid instructions on Intel VT-x Enable any of rdtscp and/or rdpid for bhyve guests on Intel-based hosts that support the "enable RDTSCP" VM-execution control. Submitted by: adam_fenn.io Reported by: chuck Reviewed by: chuck, grehan, jhb Approved by: jhb (bhyve), grehan MFC after: 3 weeks Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D26003
# 4daa95f8	24-Jun-2020	Conrad Meyer <cem@FreeBSD.org>	bhyve(8): For prototyping, reattempt decode in userspace If userspace has a newer bhyve than the kernel, it may be able to decode and emulate some instructions vmm.ko is unaware of. In this scenario, reset decoder state and try again. Reviewed by: grehan Differential Revision: https://reviews.freebsd.org/D24464
# 483d953a	04-May-2020	John Baldwin <jhb@FreeBSD.org>	Initial support for bhyve save and restore. Save and restore (also known as suspend and resume) permits a snapshot to be taken of a guest's state that can later be resumed. In the current implementation, bhyve(8) creates a UNIX domain socket that is used by bhyvectl(8) to send a request to save a snapshot (and optionally exit after the snapshot has been taken). A snapshot currently consists of two files: the first holds a copy of guest RAM, and the second file holds other guest state such as vCPU register values and device model state. To resume a guest, bhyve(8) must be started with a matching pair of command line arguments to instantiate the same set of device models as well as a pointer to the saved snapshot. While the current implementation is useful for several uses cases, it has a few limitations. The file format for saving the guest state is tied to the ABI of internal bhyve structures and is not self-describing (in that it does not communicate the set of device models present in the system). In addition, the state saved for some device models closely matches the internal data structures which might prove a challenge for compatibility of snapshot files across a range of bhyve versions. The file format also does not currently support versioning of individual chunks of state. As a result, the current file format is not a fixed binary format and future revisions to save and restore will break binary compatiblity of snapshot files. The goal is to move to a more flexible format that adds versioning, etc. and at that point to commit to providing a reasonable level of compatibility. As a result, the current implementation is not enabled by default. It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option for userland builds, and the kernel option BHYVE_SHAPSHOT. Submitted by: Mihai Tiganus, Flavius Anton, Darius Mihai Submitted by: Elena Mihailescu, Mihai Carabas, Sergiu Weisz Relnotes: yes Sponsored by: University Politehnica of Bucharest Sponsored by: Matthew Grooms (student scholarships) Sponsored by: iXsystems Differential Revision: https://reviews.freebsd.org/D19495
# cfdea69d	21-Apr-2020	Conrad Meyer <cem@FreeBSD.org>	vmm(4): Decode 3-byte VEX-prefixed instructions Reviewed by: grehan Differential Revision: https://reviews.freebsd.org/D24462
# 497cb925	17-Apr-2020	Conrad Meyer <cem@FreeBSD.org>	vmm.h: Add ABI assertions and mark implicit holes The static assertions were added (with size and offsets from gdb) and verified with a build prior to marking the holes explicitly. This is in preparation for a subsequent revision, pending in phabricator, that makes use of some of these unused bits without impacting the ABI. Reviewed by: grehan Differential Revision: https://reviews.freebsd.org/D24461
# b837dadd	02-Jan-2020	Konstantin Belousov <kib@FreeBSD.org>	bhyve: terminate waiting loops if thread suspension is requested. PR: 242724 Reviewed by: markj Reported and tested by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com> (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22881
# cbd03a9d	13-Dec-2019	John Baldwin <jhb@FreeBSD.org>	Support software breakpoints in the debug server on Intel CPUs. - Allow the userland hypervisor to intercept breakpoint exceptions (BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to enable this feature. These exceptions are reported to userland via a new VM_EXITCODE_BPT that includes the length of the original breakpoint instruction. If userland wishes to pass the exception through to the guest, it must be explicitly re-injected via vm_inject_exception(). - Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH pseudo-register. Injecting a BP# on Intel requires setting this to the length of the breakpoint instruction. AMD SVM currently ignores writes to this register (but reports success) and fails to read it. - Rework the per-vCPU state tracked by the debug server. Rather than a single 'stepping_vcpu' global, add a structure for each vCPU that tracks state about that vCPU ('stepping', 'stepped', and 'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is currently reporting an event. Event handlers for MTRAP and breakpoint exits loop until the associated event is reported to the debugger. Breakpoint events are discarded if the breakpoint is not present when a vCPU resumes in the breakpoint handler to retry submitting the breakpoint event. - Maintain a linked-list of active breakpoints in response to the GDB 'Z0' and 'z0' packets. Reviewed by: markj (earlier version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D20309
# 490d56c5	31-Jul-2019	Ed Maste <emaste@FreeBSD.org>	vmx: use C99 bool, not boolean_t Bhyve's vmm is a self-contained modern component and thus a good candidate for use of C99 types. Reviewed by: jhb, kib, markj, Patrick Mooney MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21036
# 422a8a4d	12-Jul-2019	Scott Long <scottl@FreeBSD.org>	Tie the name limit of a VM to SPECNAMELEN from devfs instead of a hard-coded value. Don't allocate space for it from the kernel stack. Account for prefix, suffix, and separator space in the name. This takes the effective length up to 229 bytes on 13-current, and 37 bytes on 12-stable. 37 bytes is enough to hold a full GUID string. PR: 234134 MFC after: 1 week Differential Revision: http://reviews.freebsd.org/D20924
# a488c9c9	25-Apr-2019	Rodney W. Grimes <rgrimes@FreeBSD.org>	Add accessor function for vm->maxcpus Replace most VM_MAXCPU constant useses with an accessor function to vm->maxcpus which for now is initialized and kept at the value of VM_MAXCPUS. This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de) work from D10070 to adjust it for the cpu topology changes that occured in r332298 Submitted by: Fabian Freyer (fabian.freyer_physik.tu-berlin.de) Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Approved by: bde (mentor), jhb (maintainer) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D18755
# 27d26457	27-Sep-2018	Andrew Turner <andrew@FreeBSD.org>	Handle a guest executing a vm instruction by trapping and raising an undefined instruction exception. Previously we would exit the guest, however an unprivileged user could execute these. Found with: syzkaller Reviewed by: araujo, tychon (previous version) Approved by: re (kib) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17192
# 147d12a7	15-May-2018	Antoine Brodin <antoine@FreeBSD.org>	vmmdev: return EFAULT when trying to read beyond VM system memory max address Currently, when using dd(1) to take a VM memory image, the capture never ends, reading zeroes when it's beyond VM system memory max address. Return EFAULT when trying to read beyond VM system memory max address. Reviewed by: imp, grehan, anish Approved by: grehan Differential Revision: https://reviews.freebsd.org/D15156
# 6ac73777	13-Apr-2018	Tycho Nightingale <tychon@FreeBSD.org>	Add SDT probes to vmexit on Intel. Submitted by: domagoj.stolfa_gmail.com Reviewed by: grehan, tychon Sponsored by: DARPA/AFRL Differential Revision: https://reviews.freebsd.org/D14656
# 01d822d3	08-Apr-2018	Rodney W. Grimes <rgrimes@FreeBSD.org>	Add the ability to control the CPU topology of created VMs from userland without the need to use sysctls, it allows the old sysctls to continue to function, but deprecates them at FreeBSD_version 1200060 (Relnotes for deprecate). The command line of bhyve is maintained in a backwards compatible way. The API of libvmmapi is maintained in a backwards compatible way. The sysctl's are maintained in a backwards compatible way. Added command option looks like: bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n] The optional parts can be specified in any order, but only a single integer invokes the backwards compatible parse. [,maxcpus=n] is hidden by #ifdef until kernel support is added, though the api is put in place. bhyvectl --get-cpu-topology option added. Reviewed by: grehan (maintainer, earlier version), Reviewed by: bcr (manpages) Approved by: bde (mentor), phk (mentor) Tested by: Oleg Ginzburg <olevole@olevole.ru> (cbsd) MFC after: 1 week Relnotes: Y Differential Revision: https://reviews.freebsd.org/D9930
# fc276d92	06-Apr-2018	John Baldwin <jhb@FreeBSD.org>	Add a way to temporarily suspend and resume virtual CPUs. This is used as part of implementing run control in bhyve's debug server. The hypervisor now maintains a set of "debugged" CPUs. Attempting to run a debugged CPU will fail to execute any guest instructions and will instead report a VM_EXITCODE_DEBUG exit to the userland hypervisor. Virtual CPUs are placed into the debugged state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl). Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl). The debug server suspends virtual CPUs when it wishes them to stop executing in the guest (for example, when a debugger attaches to the server). The debug server can choose to resume only a subset of CPUs (for example, when single stepping) or it can choose to resume all CPUs. The debug server must explicitly mark a CPU as resumed via vm_resume_cpu() before the virtual CPU will successfully execute any guest instructions. Reviewed by: avg, grehan Tested on: Intel (jhb), AMD (avg) Differential Revision: https://reviews.freebsd.org/D14466
# 490768e2	07-Mar-2018	Tycho Nightingale <tychon@FreeBSD.org>	Fix a lock recursion introduced in r327065. Reported by: kmacy Reviewed by: grehan, jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14548
# 65eefbe4	17-Jan-2018	John Baldwin <jhb@FreeBSD.org>	Save and restore guest debug registers. Currently most of the debug registers are not saved and restored during VM transitions allowing guest and host debug register values to leak into the opposite context. One result is that hardware watchpoints do not work reliably within a guest under VT-x. Due to differences in SVM and VT-x, slightly different approaches are used. For VT-x: - Enable debug register save/restore for VM entry/exit in the VMCS for DR7 and MSR_DEBUGCTL. - Explicitly save DR0-3,6 of the guest. - Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from %rflags for the host. Note that because DR6 is "software" managed and not stored in the VMCS a kernel debugger which single steps through VM entry could corrupt the guest DR6 (since a single step trap taken after loading the guest DR6 could alter the DR6 register). To avoid this, explicitly disable single-stepping via the trace flag before loading the guest DR6. A determined debugger could still defeat this by setting a breakpoint after the guest DR6 was loaded and then single-stepping. For SVM: - Enable debug register caching in the VMCB for DR6/DR7. - Explicitly save DR0-3 of the guest. - Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM saves the guest DR6 in the VMCB, the race with single-stepping described for VT-x does not exist. For both platforms, expose all of the guest DRx values via --get-drX and --set-drX flags to bhyvectl. Discussed with: avg, grehan Tested by: avg (SVM), myself (VT-x) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D13229
# c49761dd	27-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/amd64: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.
# edafb5a3	03-May-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/amd64: Small spelling fixes. No functional change.
# 9b1aa8d6	18-Jun-2015	Neel Natu <neel@FreeBSD.org>	Restructure memory allocation in bhyve to support "devmem". devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer where doing a trap-and-emulate for every access is impractical. devmem is a hybrid of system memory (sysmem) and emulated device models. devmem is mapped in the guest address space via nested page tables similar to sysmem. However the address range where devmem is mapped may be changed by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is usually mapped RO or RW as compared to RWX mappings for sysmem. Each devmem segment is named (e.g. "bootrom") and this name is used to create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom). The device node supports mmap(2) and this decouples the host mapping of devmem from its mapping in the guest address space (which can change). Reviewed by: tychon Discussed with: grehan Differential Revision: https://reviews.freebsd.org/D2762 MFC after: 4 weeks
# 248e6799	28-May-2015	Neel Natu <neel@FreeBSD.org>	Fix non-deterministic delays when accessing a vcpu that was in "running" or "sleeping" state. This is done by forcing the vcpu to transition to "idle" by returning to userspace with an exit code of VM_EXITCODE_REQIDLE. MFC after: 2 weeks
# ede04033	06-May-2015	Neel Natu <neel@FreeBSD.org>	Check 'td_owepreempt' and yield the vcpu thread if it is set. This is done explicitly because a vcpu thread can be in a critical section for the entire time slice alloted to it. This in turn can delay the handling of the 'td_owepreempt'. Reviewed by: jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D2430
# 9c4d5478	06-May-2015	Neel Natu <neel@FreeBSD.org>	Deprecate the 3-way return values from vm_gla2gpa() and vm_copy_setup(). Prior to this change both functions returned 0 for success, -1 for failure and +1 to indicate that an exception was injected into the guest. The numerical value of ERESTART also happens to be -1 so when these functions returned -1 it had to be translated to a positive errno value to prevent the VM_RUN ioctl from being inadvertently restarted. This made it easy to introduce bugs when writing emulation code. Fix this by adding an 'int *guest_fault' parameter and setting it to '1' if an exception was delivered to the guest. The return value is 0 or EFAULT so no additional translation is needed. Reviewed by: tychon MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D2428
# 8325ce5c	30-Apr-2015	Neel Natu <neel@FreeBSD.org>	Don't require <sys/cpuset.h> to be always included before <machine/vmm.h>. Only a subset of source files that include <machine/vmm.h> need to use the APIs that require the inclusion of <sys/cpuset.h>. MFC after: 1 week
# e4f605ee	24-Mar-2015	Tycho Nightingale <tychon@FreeBSD.org>	When fetching an instruction in non-64bit mode, consider the value of the code segment base address. Also if an instruction doesn't support a mod R/M (modRM) byte, don't be concerned if the CPU is in real mode. Reviewed by: neel
# 75346353	18-Jan-2015	Neel Natu <neel@FreeBSD.org>	MOVS instruction emulation. These instructions are emitted by 'bus_space_read_region()' when accessing MMIO regions. Since MOVS can be used with a repeat prefix start decoding the REPZ and REPNZ prefixes. Also start decoding the segment override prefix since MOVS allows overriding the source operand segment register. Tested by: tychon MFC after: 1 week
# c9c75df4	13-Jan-2015	Neel Natu <neel@FreeBSD.org>	'struct vm_exception' was intended to be used only as the collateral for the VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping track pending exceptions for a vcpu. This in turn causes confusion because some fields in 'struct vm_exception' like 'vcpuid' make sense only in the ioctl context. It also makes it harder to add or remove structure fields. Fix this by using 'struct vm_exception' only to communicate information from userspace to vmm.ko when injecting an exception. Also, add a field 'restart_instruction' to 'struct vm_exception'. This field is set to '1' for exceptions where the faulting instruction is restarted after the exception is handled. MFC after: 1 week
# 0dafa5cd	30-Dec-2014	Neel Natu <neel@FreeBSD.org>	Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko. The new RTC emulation supports all interrupt modes: periodic, update ended and alarm. It is also capable of maintaining the date/time and NVRAM contents across virtual machine reset. Also, the date/time fields can now be modified by the guest. Since bhyve now emulates both the PIT and the RTC there is no need for "Legacy Replacement Routing" in the HPET so get rid of it. The RTC device state can be inspected via bhyvectl as follows: bhyvectl --vm=vm --get-rtc-time bhyvectl --vm=vm --set-rtc-time=<unix_time_secs> bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value> Reviewed by: tychon Discussed with: grehan Differential Revision: https://reviews.freebsd.org/D1385 MFC after: 2 weeks
# b0538143	22-Dec-2014	Neel Natu <neel@FreeBSD.org>	Allow ktr(4) tracing of all guest exceptions via the tunable "hw.vmm.trace_guest_exceptions". To enable this feature set the tunable to "1" before loading vmm.ko. Tracing the guest exceptions can be useful when debugging guest triple faults. Note that there is a performance impact when exception tracing is enabled since every exception will now trigger a VM-exit. Also, handle machine check exceptions that happen during guest execution by vectoring to the host's machine check handler via "int $18". Discussed with: grehan MFC after: 2 weeks
# 160ef77a	25-Oct-2014	Neel Natu <neel@FreeBSD.org>	Move the ACPI PM timer emulation into vmm.ko. This reduces variability during timer calibration by keeping the emulation "close" to the guest. Additionally having all timer emulations in the kernel will ease the transition to a per-VM clock source (as opposed to using the host's uptime keep track of time). Discussed with: grehan
# 65145c7f	06-Oct-2014	Neel Natu <neel@FreeBSD.org>	Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'. The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to be present anyways. Discussed with: grehan
# c3498942	19-Sep-2014	Neel Natu <neel@FreeBSD.org>	Restructure the MSR handling so it is entirely handled by processor-specific code. There are only a handful of MSRs common between the two so there isn't too much duplicate functionality. The VT-x code has the following types of MSRs: - MSRs that are unconditionally saved/restored on every guest/host context switch (e.g., MSR_GSBASE). - MSRs that are restored to guest values on entry to vmx_run() and saved before returning. This is an optimization for MSRs that are not used in host kernel context (e.g., MSR_KGSBASE). - MSRs that are emulated and every access by the guest causes a trap into the hypervisor (e.g., MSR_IA32_MISC_ENABLE). Reviewed by: grehan
# bbadcde4	13-Sep-2014	Neel Natu <neel@FreeBSD.org>	Set the 'vmexit->inst_length' field properly depending on the type of the VM-exit and ultimately on whether nRIP is valid. This allows us to update the %rip after the emulation is finished so any exceptions triggered during the emulation will point to the right instruction. Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability is available. The effective segment field in EXITINFO1 is not valid without this capability. Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging. Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2 fields in bhyve(8). Reviewed by: Anish Gupta (akgupt3@gmail.com) Reviewed by: grehan
# d1819632	12-Sep-2014	Neel Natu <neel@FreeBSD.org>	Optimize the common case of injecting an interrupt into a vcpu after a HLT by explicitly moving it out of the interrupt shadow. The hypervisor is done "executing" the HLT and by definition this moves the vcpu out of the 1-instruction interrupt shadow. Prior to this change the interrupt would be held pending because the VMCS guest-interruptibility-state would indicate that "blocking by STI" was in effect. This resulted in an unnecessary round trip into the guest before the pending interrupt could be injected. Reviewed by: grehan
# 7f21538b	23-Aug-2014	Peter Grehan <grehan@FreeBSD.org>	Change __inline style to be consistent with FreeBSD usage, and also fix gcc build (on STABLE, when MFCd). PR: 192880 Reviewed by: neel Reported by: ngie MFC after: 1 day
# f008d157	25-Jul-2014	Neel Natu <neel@FreeBSD.org>	If a vcpu has issued a HLT instruction with interrupts disabled then it sleeps forever in vm_handle_hlt(). This is usually not an issue as long as one of the other vcpus properly resets or powers off the virtual machine. However, if the bhyve(8) process is killed with a signal the halted vcpu cannot be woken up because it's sleep cannot be interrupted. Fix this by waking up periodically and returning from vm_handle_hlt() if TDF_ASTPENDING is set. Reported by: Leon Dang Sponsored by: Nahanni Systems
# d37f2adb	23-Jul-2014	Neel Natu <neel@FreeBSD.org>	Fix fault injection in bhyve. The faulting instruction needs to be restarted when the exception handler is done handling the fault. bhyve now does this correctly by setting 'vmexit[vcpu].inst_length' to zero so the %rip is not advanced. A minor complication is that the fault injection APIs are used by instruction emulation code that is shared by vmm.ko and bhyve. Thus the argument that refers to 'struct vm ' in kernel or 'struct vmctx ' in userspace needs to be loosely typed as a 'void *'.
# d665d229	22-Jul-2014	Neel Natu <neel@FreeBSD.org>	Emulate instructions emitted by OpenBSD/i386 version 5.5: - CMP REG, r/m - MOV AX/EAX/RAX, moffset - MOV moffset, AX/EAX/RAX - PUSH r/m
# 091d4532	19-Jul-2014	Neel Natu <neel@FreeBSD.org>	Handle nested exceptions in bhyve. A nested exception condition arises when a second exception is triggered while delivering the first exception. Most nested exceptions can be handled serially but some are converted into a double fault. If an exception is generated during delivery of a double fault then the virtual machine shuts down as a result of a triple fault. vm_exit_intinfo() is used to record that a VM-exit happened while an event was being delivered through the IDT. If an exception is triggered while handling the VM-exit it will be treated like a nested exception. vm_entry_intinfo() is used by processor-specific code to get the event to be injected into the guest on the next VM-entry. This function is responsible for deciding the disposition of nested exceptions.
# 3d5444c8	16-Jul-2014	Neel Natu <neel@FreeBSD.org>	Add emulation for legacy x86 task switching mechanism. FreeBSD/i386 uses task switching to handle double fault exceptions and this change enables that to work. Reported by: glebius
# f7a9f178	15-Jul-2014	Neel Natu <neel@FreeBSD.org>	Add support for operand size and address size override prefixes in bhyve's instruction emulation [1]. Fix bug in emulation of opcode 0x8A where the destination is a legacy high byte register and the guest vcpu is in 32-bit mode. Prior to this change instead of modifying %ah, %bh, %ch or %dh the emulation would end up modifying %spl, %bpl, %sil or %dil instead. Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value during instruction decoding. Fix bug in verify_gla() where the linear address computed after decoding the instruction was not being truncated to the effective address size [2]. Tested by: Leon Dang [1] Reported by: Peter Grehan [2] Sponsored by: Nahanni Systems
# b301b9e2	08-Jul-2014	Neel Natu <neel@FreeBSD.org>	Accurately identify the vcpu's operating mode as 64-bit, compatibility, protected or real.
# 5ebc578b	10-Jun-2014	Tycho Nightingale <tychon@FreeBSD.org>	Replace enum forward declarations with complete definitions. Reviewed by: neel
# 40487465	10-Jun-2014	Neel Natu <neel@FreeBSD.org>	Add helper functions to populate VM exit information for rendezvous and astpending exits. This is to reduce code duplication between VT-x and SVM implementations.
# 5fcf252f	07-Jun-2014	Neel Natu <neel@FreeBSD.org>	Add ioctl(VM_REINIT) to reinitialize the virtual machine state maintained by vmm.ko. This allows the virtual machine to be restarted without having to destroy it first. Reviewed by: grehan
# 95ebc360	31-May-2014	Neel Natu <neel@FreeBSD.org>	Activate vcpus from bhyve(8) using the ioctl VM_ACTIVATE_CPU instead of doing it implicitly in vmm.ko. Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and "--get-suspended-cpus" options. This is in preparation for being able to reset virtual machine state without having to destroy and recreate it.
# 65ffa035	26-May-2014	Neel Natu <neel@FreeBSD.org>	Add segment protection and limits violation checks in vie_calculate_gla() for 32-bit x86 guests. Tested using ins/outs executed in a FreeBSD/i386 guest.
# 5382c19d	24-May-2014	Neel Natu <neel@FreeBSD.org>	Do the linear address calculation for the ins/outs emulation using a new API function 'vie_calculate_gla()'. While the current implementation is simplistic it forms the basis of doing segmentation checks if the guest is in 32-bit protected mode.
# da11f4aa	24-May-2014	Neel Natu <neel@FreeBSD.org>	Add libvmmapi functions vm_copyin() and vm_copyout() to copy into and out of the guest linear address space. These APIs in turn use a new ioctl 'VM_GLA2GPA' to convert the guest linear address to guest physical. Use the new copyin/copyout APIs when emulating ins/outs instruction in bhyve(8).
# e813a873	24-May-2014	Neel Natu <neel@FreeBSD.org>	Consolidate all the information needed by the guest page table walker into 'struct vm_guest_paging'. Check for canonical addressing in vmm_gla2gpa() and inject a protection fault into the guest if a violation is detected. If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to point to the root of the page tables.
# 37a723a5	24-May-2014	Neel Natu <neel@FreeBSD.org>	When injecting a page fault into the guest also update the guest's %cr2 to indicate the faulting linear address. If the guest PML4 entry has the PG_PS bit set then inject a page fault into the guest with the PGEX_RSV bit set in the error_code. Get rid of redundant checks for the PG_RW violations when walking the page tables.
# d17b5104	22-May-2014	Neel Natu <neel@FreeBSD.org>	Add emulation of the "outsb" instruction. NetBSD guests use this to write to the UART FIFO. The emulation is constrained in a number of ways: 64-bit only, doesn't check for all exception conditions, limited to i/o ports emulated in userspace. Some of these constraints will be relaxed in followup commits. Requested by: grehan Reviewed by: tychon (partially and a much earlier version)
# fd949af6	21-May-2014	Neel Natu <neel@FreeBSD.org>	Inject page fault into the guest if the page table walker detects an invalid translation for the guest linear address.
# e4c8a13d	18-May-2014	Neel Natu <neel@FreeBSD.org>	Add PG_U (user/supervisor) checks when translating a guest linear address to a guest physical address. PG_PS (page size) field is valid only in a PDE or a PDPTE so it is now checked only in non-terminal paging entries. Ignore the upper 32-bits of the CR3 for PAE paging.
# b3e9732a	15-May-2014	John Baldwin <jhb@FreeBSD.org>	Implement a PCI interrupt router to route PCI legacy INTx interrupts to the legacy 8259A PICs. - Implement an ICH-comptabile PCI interrupt router on the lpc device with 8 steerable pins configured via config space access to byte-wide registers at 0x60-63 and 0x68-6b. - For each configured PCI INTx interrupt, route it to both an I/O APIC pin and a PCI interrupt router pin. When a PCI INTx interrupt is asserted, ensure that both pins are asserted. - Provide an initial routing of PCI interrupt router (PIRQ) pins to 8259A pins (ISA IRQs) and initialize the interrupt line config register for the corresponding PCI function with the ISA IRQ as this matches existing hardware. - Add a global _PIC method for OSPM to select the desired interrupt routing configuration. - Update the _PRT methods for PCI bridges to provide both APIC and legacy PRT tables and return the appropriate table based on the configured routing configuration. Note that if the lpc device is not configured, no routing information is provided. - When the lpc device is enabled, provide ACPI PCI link devices corresponding to each PIRQ pin. - Add a VMM ioctl to adjust the trigger mode (edge vs level) for 8259A pins via the ELCR. - Mark the power management SCI as level triggered. - Don't hardcode the number of elements in Packages in the source for the DSDT. iasl(8) will fill in the actual number of elements, and this makes it simpler to generate a Package with a variable number of elements. Reviewed by: tycho
# e50ce2aa	01-May-2014	Neel Natu <neel@FreeBSD.org>	Add logic in the HLT exit handler to detect if the guest has put all vcpus to sleep permanently by executing a HLT with interrupts disabled. When this condition is detected the guest with be suspended with a reason of VM_SUSPEND_HALT and the bhyve(8) process will exit. Tested by executing "halt" inside a RHEL7-beta guest. Discussed with: grehan@ Reviewed by: jhb@, tychon@
# c6a0cc2e	29-Apr-2014	Neel Natu <neel@FreeBSD.org>	Some Linux guests will implement a 'halt' by disabling the APIC and executing the 'HLT' instruction. This condition was detected by 'vm_handle_hlt()' and converted into the SPINDOWN_CPU exitcode . The bhyve(8) process would exit the vcpu thread in response to a SPINDOWN_CPU and when the last vcpu was spun down it would reset the virtual machine via vm_suspend(VM_SUSPEND_RESET). This functionality was broken in r263780 in a way that made it impossible to kill the bhyve(8) process because it would loop forever in vm_handle_suspend(). Unbreak this by removing the code to spindown vcpus. Thus a 'halt' from a Linux guest will appear to be hung but this is consistent with the behavior on bare metal. The guest can be rebooted by using the bhyvectl options '--force-reset' or '--force-poweroff'. Reviewed by: grehan@
# f0fdcfe2	28-Apr-2014	Neel Natu <neel@FreeBSD.org>	Allow a virtual machine to be forcibly reset or powered off. This is done by adding an argument to the VM_SUSPEND ioctl that specifies how the virtual machine should be suspended, viz. VM_SUSPEND_RESET or VM_SUSPEND_POWEROFF. The disposition of VM_SUSPEND is also made available to the exit handler via the 'u.suspended' member of 'struct vm_exit'. This capability is exposed via the '--force-reset' and '--force-poweroff' arguments to /usr/sbin/bhyvectl. Discussed with: grehan@
# b15a09c0	26-Mar-2014	Neel Natu <neel@FreeBSD.org>	Add an ioctl to suspend a virtual machine (VM_SUSPEND). The ioctl can be called from any context i.e., it is not required to be called from a vcpu thread. The ioctl simply sets a state variable 'vm->suspend' to '1' and returns. The vcpus inspect 'vm->suspend' in the run loop and if it is set to '1' the vcpu breaks out of the loop with a reason of 'VM_EXITCODE_SUSPENDED'. The suspend handler waits until all 'vm->active_cpus' have transitioned to 'vm->suspended_cpus' before returning to userspace. Discussed with: grehan
# e883c9bb	25-Mar-2014	Tycho Nightingale <tychon@FreeBSD.org>	Move the atpit device model from userspace into vmm.ko for better precision and lower latency. Approved by: grehan (co-mentor)
# 0775fbb4	15-Mar-2014	Tycho Nightingale <tychon@FreeBSD.org>	Fix a race wherein the source of an interrupt vector is wrongly attributed if an ExtINT arrives during interrupt injection. Also, fix a spurious interrupt if the PIC tries to raise an interrupt before the outstanding one is accepted. Finally, improve the PIC interrupt latency when another interrupt is raised immediately after the outstanding one is accepted by creating a vmexit rather than waiting for one to occur by happenstance. Approved by: neel (co-mentor)
# 762fd208	11-Mar-2014	Tycho Nightingale <tychon@FreeBSD.org>	Replace the userspace atpic stub with a more functional vmm.ko model. New ioctls VM_ISA_ASSERT_IRQ, VM_ISA_DEASSERT_IRQ and VM_ISA_PULSE_IRQ can be used to manipulate the pic, and optionally the ioapic, pin state. Reviewed by: jhb, neel Approved by: neel (co-mentor)
# dc506506	25-Feb-2014	Neel Natu <neel@FreeBSD.org>	Queue pending exceptions in the 'struct vcpu' instead of directly updating the processor-specific VMCS or VMCB. The pending exception will be delivered right before entering the guest. The order of event injection into the guest is: - hardware exception - NMI - maskable interrupt In the Intel VT-x case, a pending NMI or interrupt will enable the interrupt window-exiting and inject it as soon as possible after the hardware exception is injected. Also since interrupts are inherently asynchronous, injecting them after the hardware exception should not affect correctness from the guest perspective. Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict it to only deliver x86 hardware exceptions. This new ioctl is now used to inject a protection fault when the guest accesses an unimplemented MSR. Discussed with: grehan, jhb Reviewed by: jhb
# 52e5c8a2	19-Feb-2014	Neel Natu <neel@FreeBSD.org>	Simplify APIC mode switching from MMIO to x2APIC. In part this is done to simplify the implementation of the x2APIC virtualization assist in VT-x. Prior to this change the vlapic allowed the guest to change its mode from xAPIC to x2APIC. We don't allow that any more and the vlapic mode is locked when the virtual machine is created. This is not very constraining because operating systems already have to deal with BIOS setting up the APIC in x2APIC mode at boot. Fix a bug in the CPUID emulation where the x2APIC capability was leaking from the host to the guest. Ignore MMIO reads and writes to the vlapic in x2APIC mode. Similarly, ignore MSR accesses to the vlapic when it is in xAPIC mode. The default configuration of the vlapic is xAPIC. The "-x" option to bhyve(8) can be used to change the mode to x2APIC instead. Discussed with: grehan@
# 00f3efe1	04-Feb-2014	John Baldwin <jhb@FreeBSD.org>	Add support for FreeBSD/i386 guests under bhyve. - Similar to the hack for bootinfo32.c in userboot, define _MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot. This allows userboot to load 32-bit kernels and modules. - Copy the SMAP generation code out of bootinfo64.c and into its own file so it can be shared with bootinfo32.c to pass an SMAP to the i386 kernel. - Use uint32_t instead of u_long when aligning module metadata in bootinfo32.c in userboot, as otherwise the metadata used 64-bit alignment which corrupted the layout. - Populate the basemem and extmem members of the bootinfo struct passed to 32-bit kernels. - Fix the 32-bit stack in userboot to start at the top of the stack instead of the bottom so that there is room to grow before the kernel switches to its own stack. - Push a fake return address onto the 32-bit stack in addition to the arguments normally passed to exec() in the loader. This return address is needed to convince recover_bootinfo() in the 32-bit locore code that it is being invoked from a "new" boot block. - Add a routine to libvmmapi to setup a 32-bit flat mode register state including a GDT and TSS that is able to start the i386 kernel and update bhyveload to use it when booting an i386 kernel. - Use the guest register state to determine the CPU's current instruction mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long mode) in the instruction emulation code. Update the gla2gpa() routine used when fetching instructions to handle flat mode, 32-bit paging, and PAE paging in addition to long mode paging. Don't look for a REX prefix when the CPU is in 32-bit mode, and use the detected mode to enable the existing 32-bit mode code when decoding the mod r/m byte. Reviewed by: grehan, neel MFC after: 1 month
# 30b94db8	25-Jan-2014	Neel Natu <neel@FreeBSD.org>	Support level triggered interrupts with VT-x virtual interrupt delivery. The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector. If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger an EOI-induced VM-exit when it is doing EOI virtualization. The EOI-induced VM-exit results in the EOI being forwarded to the vioapic so that level triggered interrupts can be properly handled. Tested by: Anish Gupta (akgupt3@gmail.com)
# 5b8a8cd1	13-Jan-2014	Neel Natu <neel@FreeBSD.org>	Add an API to rendezvous all active vcpus in a virtual machine. The rendezvous can be initiated in the context of a vcpu thread or from the bhyve(8) control process. The first use of this functionality is to update the vlapic trigger-mode register when the IOAPIC pin configuration is changed. Prior to this change we would update the TMR in the virtual-APIC page at the time of interrupt delivery. But this doesn't work with Posted Interrupts because there is no way to program the EOI_exit_bitmap[] in the VMCS of the target at the time of interrupt delivery. Discussed with: grehan@
# add611fd	08-Jan-2014	Neel Natu <neel@FreeBSD.org>	Don't expose 'vmm_ipinum' as a global.
# 0492757c	01-Jan-2014	Neel Natu <neel@FreeBSD.org>	Restructure the VMX code to enter and exit the guest. In large part this change hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used to enter guest context and vmx_exit_guest() is used to transition back into host context. Fix a longstanding race where a vcpu interrupt notification might be ignored if it happens after vmx_inject_interrupts() but before host interrupts are disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with host interrupts disabled to prevent this. Suggested by: grehan@
# de5ea6b6	24-Dec-2013	Neel Natu <neel@FreeBSD.org>	vlapic code restructuring to make it easy to support hardware-assist for APIC emulation. The vlapic initialization and cleanup is done via processor specific vmm_ops. This will allow the VT-x/SVM modules to layer any hardware-assist for APIC emulation or virtual interrupt delivery on top of the vlapic device model. Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic interrupts versus other events (e.g. NMI). This provides an opportunity to use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM) to deliver an interrupt to a guest without causing a VM-exit. Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the vlapic_xxx() counterparts directly. Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'. The 'Apic Page' is intended to be referenced from the Intel VMCS as the 'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.
# 63e62d39	23-Dec-2013	John Baldwin <jhb@FreeBSD.org>	Add a resume hook for bhyve that runs a function on all CPUs during resume. For Intel CPUs, invoke vmxon for CPUs that were in VMX mode at the time of suspend. Reviewed by: neel
# f80330a8	22-Dec-2013	Neel Natu <neel@FreeBSD.org>	Add a parameter to 'vcpu_set_state()' to enforce that the vcpu is in the IDLE state before the requested state transition. This guarantees that there is exactly one ioctl() operating on a vcpu at any point in time and prevents unintended state transitions. More details available here: http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html Reviewed by: grehan Reported by: Markiyan Kushnir (markiyan.kushnir at gmail.com) MFC after: 3 days
# 1c052192	07-Dec-2013	Neel Natu <neel@FreeBSD.org>	If a vcpu disables its local apic and then executes a 'HLT' then spin down the vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore pending interrupts in the IRR if interrupts have been disabled by the guest. The interrupt cannot be injected into the guest in any case so resuming it is futile. With this change "halt" from a Linux guest works correctly. Reviewed by: grehan@ Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
# 7a3c80aa	02-Dec-2013	Neel Natu <neel@FreeBSD.org>	The 'protection' field in the VM exit collateral for the PAGING exit is not used - get rid of it.
# 22821874	02-Dec-2013	Neel Natu <neel@FreeBSD.org>	Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function has outgrown its original name. Originally this function simply sent an IPI to the host cpu that a vcpu was executing on but now it does a lot more than just that. Reviewed by: grehan@
# 08e3ff32	25-Nov-2013	Neel Natu <neel@FreeBSD.org>	Add HPET device emulation to bhyve. bhyve supports a single timer block with 8 timers. The timers are all 32-bit and capable of being operated in periodic mode. All timers support interrupt delivery using MSI. Timers 0 and 1 also support legacy interrupt routing. At the moment the timers are not connected to any ioapic pins but that will be addressed in a subsequent commit. This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).
# 565bbb86	12-Nov-2013	Neel Natu <neel@FreeBSD.org>	Move the ioapic device model from userspace into vmm.ko. This is needed for upcoming in-kernel device emulations like the HPET. The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to manipulate the ioapic pin state. Discussed with: grehan@ Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
# 49cc03da	16-Oct-2013	Neel Natu <neel@FreeBSD.org>	Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose 'invpcid' instruction to the guest. Currently bhyve will try to enable this capability unconditionally if it is available. Consolidate code in bhyve to set the capabilities so it is no longer duplicated in BSP and AP bringup. Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid' instruction is available. Reviewed by: grehan MFC after: 3 days
# 318224bb	05-Oct-2013	Neel Natu <neel@FreeBSD.org>	Merge projects/bhyve_npt_pmap into head. Make the amd64/pmap code aware of nested page table mappings used by bhyve guests. This allows bhyve to associate each guest with its own vmspace and deal with nested page faults in the context of that vmspace. This also enables features like accessed/dirty bit tracking, swapping to disk and transparent superpage promotions of guest memory. Guest vmspace: Each bhyve guest has a unique vmspace to represent the physical memory allocated to the guest. Each memory segment allocated by the guest is mapped into the guest's address space via the 'vmspace->vm_map' and is backed by an object of type OBJT_DEFAULT. pmap types: The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT. The PT_X86 pmap type is used by the vmspace associated with the host kernel as well as user processes executing on the host. The PT_EPT pmap is used by the vmspace associated with a bhyve guest. Page Table Entries: The EPT page table entries as mostly similar in functionality to regular page table entries although there are some differences in terms of what bits are used to express that functionality. For e.g. the dirty bit is represented by bit 9 in the nested PTE as opposed to bit 6 in the regular x86 PTE. Therefore the bitmask representing the dirty bit is now computed at runtime based on the type of the pmap. Thus PG_M that was previously a macro now becomes a local variable that is initialized at runtime using 'pmap_modified_bit(pmap)'. An additional wrinkle associated with EPT mappings is that older Intel processors don't have hardware support for tracking accessed/dirty bits in the PTE. This means that the amd64/pmap code needs to emulate these bits to provide proper accounting to the VM subsystem. This is achieved by using the following mapping for EPT entries that need emulation of A/D bits: Bit Position Interpreted By PG_V 52 software (accessed bit emulation handler) PG_RW 53 software (dirty bit emulation handler) PG_A 0 hardware (aka EPT_PG_RD) PG_M 1 hardware (aka EPT_PG_WR) The idea to use the mapping listed above for A/D bit emulation came from Alan Cox (alc@). The final difference with respect to x86 PTEs is that some EPT implementations do not support superpage mappings. This is recorded in the 'pm_flags' field of the pmap. TLB invalidation: The amd64/pmap code has a number of ways to do invalidation of mappings that may be cached in the TLB: single page, multiple pages in a range or the entire TLB. All of these funnel into a single EPT invalidation routine called 'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and sends an IPI to the host cpus that are executing the guest's vcpus. On a subsequent entry into the guest it will detect that the EPT has changed and invalidate the mappings from the TLB. Guest memory access: Since the guest memory is no longer wired we need to hold the host physical page that backs the guest physical page before we can access it. The helper functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose. PCI passthru: Guest's with PCI passthru devices will wire the entire guest physical address space. The MMIO BAR associated with the passthru device is backed by a vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that have one or more PCI passthru devices attached to them. Limitations: There isn't a way to map a guest physical page without execute permissions. This is because the amd64/pmap code interprets the guest physical mappings as user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U shares the same bit position as EPT_PG_EXECUTE all guest mappings become automatically executable. Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews as well as their support and encouragement. Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing object for pci passthru mmio regions. Special thanks to Peter Holm for testing the patch on short notice. Approved by: re Discussed with: grehan Reviewed by: alc, kib Tested by: pho
# 8d39ed16	09-Sep-2013	Peter Grehan <grehan@FreeBSD.org>	Go way past 11 and bump bhyve's max vCPUs to 16. This should be sufficient for 10.0 and will do until forthcoming work to avoid limitations in this area is complete. Thanks to Bela Lubkin at tidalscale for the headsup on the apic/cpu id/io apic ASL parameters that are actually hex values and broke when written as decimal when 11 vCPUs were configured. Approved by: re@
# d3c11f40	24-Apr-2013	Peter Grehan <grehan@FreeBSD.org>	Add RIP-relative addressing to the instruction decoder. Rework the guest register fetch code to allow the RIP to be extracted from the VMCS while the kernel decoder is functioning. Hit by the OpenBSD local-apic code. Submitted by: neel Reviewed by: grehan Obtained from: NetApp
# d5408b1d	11-Apr-2013	Neel Natu <neel@FreeBSD.org>	If vmm.ko could not be initialized correctly then prevent the creation of virtual machines subsequently. Submitted by: Chris Torek
# 485b3300	11-Feb-2013	Neel Natu <neel@FreeBSD.org>	Implement guest vcpu pinning using 'pthread_setaffinity_np(3)'. Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING) that called 'sched_bind()' on behalf of the user thread. The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn runs afoul of the assertion '(td_pinned == 0)' in userret(). Using the cpuset affinity to implement pinning of the vcpu threads works with both 4BSD and ULE schedulers and has the happy side-effect of getting rid of a bunch of code in vmm.ko. Discussed with: grehan
# 912a3e67	19-Jan-2013	Neel Natu <neel@FreeBSD.org>	Add svn properties to the recently merged bhyve source files. The pre-commit hook will not allow any commits without the svn:keywords property in head.
# 48a29f4e	28-Nov-2012	Neel Natu <neel@FreeBSD.org>	Cleanup the user-space paging exit handler now that the unified instruction emulation is in place. Obtained from: NetApp
# ba9b7bf7	27-Nov-2012	Neel Natu <neel@FreeBSD.org>	Revamp the x86 instruction emulation in bhyve. On a nested page table fault the hypervisor will: - fetch the instruction using the guest %rip and %cr3 - decode the instruction in 'struct vie' - emulate the instruction in host kernel context for local apic accesses - any other type of mmio access is punted up to user-space (e.g. ioapic) The decoded instruction is passed as collateral to the user-space process that is handling the PAGING exit. The emulation code is fleshed out to include more addressing modes (e.g. SIB) and more types of operands (e.g. imm8). The source code is unified into a single file (vmm_instruction_emul.c) that is compiled into vmm.ko as well as /usr/sbin/bhyve. Reviewed by: grehan Obtained from: NetApp
# f352ff0c	23-Oct-2012	Neel Natu <neel@FreeBSD.org>	Maintain state regarding NMI delivery to guest vcpu in VT-x independent manner. Also add a stats counter to count the number of NMIs delivered per vcpu. Obtained from: NetApp
# 13ec9371	12-Oct-2012	Peter Grehan <grehan@FreeBSD.org>	Add the guest physical address and r/w/x bits to the paging exit in preparation for a rework of bhyve MMIO handling. Reviewed by: neel Obtained from: NetApp
# 75dd3366	12-Oct-2012	Neel Natu <neel@FreeBSD.org>	Provide per-vcpu locks instead of relying on a single big lock. This also gets rid of all the witness.watch warnings related to calling malloc(M_WAITOK) while holding a mutex. Reviewed by: grehan
# bda273f2	02-Oct-2012	Neel Natu <neel@FreeBSD.org>	Get rid of assumptions in the hypervisor that the host physical memory associated with guest physical memory is contiguous. Rewrite vm_gpa2hpa() to get the GPA to HPA mapping by querying the nested page tables.
# 341f19c9	28-Sep-2012	Neel Natu <neel@FreeBSD.org>	Get rid of assumptions in the hypervisor that the host physical memory associated with guest physical memory is contiguous. In this case vm_malloc() was using vm_gpa2hpa() to indirectly infer whether or not the address range had already been allocated. Replace this instead with an explicit API 'vm_gpa_available()' that returns TRUE if a page is available for allocation in guest physical address space.
# e9027382	25-Sep-2012	Neel Natu <neel@FreeBSD.org>	Add ioctls to control the X2APIC capability exposed by the virtual machine to the guest. At the moment this simply sets the state in the 'vcpu' instance but there is no code that acts upon these settings.
# edf89256	24-Sep-2012	Neel Natu <neel@FreeBSD.org>	Add an explicit exit code 'SPINUP_AP' to tell the controlling process that an AP needs to be activated by spinning up an execution context for it. The local apic emulation is now completely done in the hypervisor and it will detect writes to the ICR_LO register that try to bring up the AP. In response to such writes it will return to userspace with an exit code of SPINUP_AP. Reviewed by: grehan
# 98ed632c	24-Sep-2012	Neel Natu <neel@FreeBSD.org>	Stash the 'vm_exit' information in each 'struct vcpu'. There is no functional change at this time but this paves the way for vm exit handler functions to easily modify the exit reason going forward.
# cd942e0f	28-Apr-2012	Peter Grehan <grehan@FreeBSD.org>	MSI-x interrupt support for PCI pass-thru devices. Includes instruction emulation for memory r/w access. This opens the door for io-apic, local apic, hpet timer, and legacy device emulation. Submitted by: ryan dot berryhill at sandvine dot com Reviewed by: grehan Obtained from: Sandvine
# 366f6083	12-May-2011	Peter Grehan <grehan@FreeBSD.org>	Import of bhyve hypervisor and utilities, part 1. vmm.ko - kernel module for VT-x, VT-d and hypervisor control bhyve - user-space sequencer and i/o emulation vmmctl - dump of hypervisor register state libvmm - front-end to vmm.ko chardev interface bhyve was designed and implemented by Neel Natu. Thanks to the following folk from NetApp who helped to make this available: Joe CaraDonna Peter Snyder Jeff Heller Sandeep Mann Steve Miller Brian Pawlowski